If I want to use multiple gpus for a network should i specifically write my network in a way that it is designed to be train on multiple gpus or can i just add some comment to switch between one gpu and multiple gpus?
You can just wrap your model in DataParallel and specify the device_ids you would like to use.
The data will be split among the batch dimensions. See the data parellel tutorial for more information.
If your system has >=5 GPUs, your code should work.
Could you print gpu_ids before passing it to nn.DataParallel?
Also, what does nvidia-smi show?
Do you see any utilization in the four GPUs?
Are you passing a batch with more than 5 samples to the model?
Note that each chunk of the batch will be send to each GPU, so you should at least pass one sample for each GPU.
In the tutorials, it mentions nothing about training (ie: no loss function, loss.backward(), optimizer.step() ).
When the model is converted to a DataParallel model, does the backprop get seamlessly handled behind the scenes? I guess the same gradients would be passed to all instances of the model across gpus, is that right? Is there another mechanism to synchronise the model instances across gpus?
When using DataParallel, is there is a way to set maximum allocated memory for single GPU?
When running my custom loss, additional memory gets allocated during the forward pass which is NOT only a few hundred megabytes. DataParallel is leaving some additional space on my GPU memory, but it’s not enough for running the forward loss.
Other than that, how does DataParallel decide how to split data?
Not that I’m aware of. Did the work around mentioned in the blog post not work for you?
I implemented everything as mentioned in the blog post but GPU #1 still runs out of memory when increasing the batch_size. I notice a significant increase in memory inside the loss forward method.
So I think initialization is fine but at runtime it just uses more memory with every computation added to the graph inside the loss forward method.
DataParallel has to have a mechanism to predict how much memory will be used by a model as well as the loss(criterion), right? So I should expose that information to DataParallelCriterion when initializing or change my loss function.
I am trying to parallelize an existing model (Transformer) on muliple GPUs. I have a NVIDIA DGX-1 with 8 Volta.
These are the issues/questions I ran into:
not clear which value should I set for model.to(device). currently trying with simply .to(‘cuda’)
the model is using a lambda LR scheduler but with GPUs > 1 it exits with an execution error “UnboundLocalError: local variable ‘values’ referenced before assignment”
how can I effectlvely check tensor boundaries limit? I am getting the impression that moving a model from 1 to multiple GPUs requires a profound refactoring of the code. It is extremely difficult to understand where indexs can go out of bounds.