hi thanks for your solution, but when i do this I get error in my loss function :
buffer[torch.eq(target, -1.)] = 0
RuntimeError: invalid argument 2: sizes do not match at /opt/conda/conda-bld/pytorch_1512946747676/work/torch/lib/THC/generated/…/generic/THCTensorMasked.cu:13
this is not an error in my code but an error popping up after using parallelism of data ( as i tried to run my less intensive code both with and without data parallelism and it throws up same error while using it with data parallelsim)
my model is intensive and I have 2 gpu’s 12206MiB each. I just need to split my model to use both gpu’s while training as well as testing.
Not necessarily, but it depends on your use case of course.
You can pass all model parameters to a single optimizer, define per-parameter options or use different optimizers. The split in submodules does not limit your options and you could handle the submodules as single layers.
However, my question is not about the ‘unbalanced GPU usage’ , is about the GPU memory allocated.
I notice that when I split the whole model in 4 gpus and do forward/backward, the GPU memory on the fisrt GPU cost much more than it should be. For example, if the whole model cost 12GB on a single GPU, when split it to four GPUs, the first GPU cost 11GB and the sum of others cost about 11GB.
Is there an explaination for how does the GPU memory be malloced when using multiple GPUs for model parallelism.
Still not knowing how to deal with the problem, hope you can help me!
Thanks for the clarification. I now understand you are sharding your model, i.e. some layers are in GPU0, the next ones on GPU1, etc. Using this approach the memory usage of all GPUs is approx. 2x the memory usage of the same model on a single GPU using the same setup (e.g. same batch size etc.).
If I understood your issue correctly, could you post a (small) reproducible code snippet so that we could have a look?
yeah, I just realized that I misunderstood @Ashwin_Raju so thanks for pointing it out.
Model sharding or model parallelism refers to splitting the model onto several GPUs, so the forward and backward graph are going through all devices.
You could use it, e.g. if you have a huge model and can use multiple GPUs.
A simple dummy case is given in this previous post.
I wonder how everything you discussed in relation to model sharding and data parallelism will be affected when using NVLink (for example, connecting two GPUs with an NVLink bridge). Will it just make things faster, or will it open new scenarios?