Error When Using Multiple GPus

manish · April 19, 2018, 11:25am

My code works fine when using just 1 GPU using torch.cuda.set_device(0) but it takes a lot of time to train in single GPU.

I tried various ways to Parallelize it, but nothing seems to work.
Currently Iam trying :

gpu_ids = [0,1,2,3]
model = torch.nn.DataParallel(model,device_ids=gpu_ids)
model.cuda()

But when i try to access some model methods, like model.encoder.eval() it throws error saying 'DataParallel' object has no attribute 'encoder'

What am I missing ?

ptrblck · April 19, 2018, 11:29am

After wrapping the model in DataParallel the Modules are stored in model.module.
Try model.module.encoder.

manish · April 19, 2018, 6:41pm

Hey it seems like solving my problem, but do i have to replace every instance of model.“abc” with model.module.abc

like :model.train() to model.module.train()
model.eval() to model.module.eval() and so on

Please let me know it there are some exceptions to this ?

manish · April 19, 2018, 7:17pm

@ptrblck Now it says cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:66 and Uses only GPU- 0

Does DataParallel executes batches parallely and merges them together ?

ptrblck · April 19, 2018, 7:48pm

No, just call .eval on the DataParallel instance.

DataParallel uses a bit more memory on the default GPU, which is GPU0 by default. If you are using this GPU for other processes, e.g. your desktop, you could change the order of device ids like: device_ids=[1, 0].

manish · April 19, 2018, 8:07pm

Previously I was doing like : model.encode.eval() and now doing model.module.encode.eval() is the right way rt ?

But I see that only 1 GPU being used in nvidia-smi . Any possible reason for same ?

manish · April 19, 2018, 8:27pm

https://ideone.com/gJVwSk code works fine for 1 GPU. Can you have a look and suggest me what changes to make ?
https://ideone.com/YyBOa0 one version is what i changed after you suggested me… Please look where i did wrong…

Please Help me out… Its urgent

ptrblck · April 19, 2018, 9:14pm

Remove torch.cuda.set_device(gpu) and try to use DataParallel again.
Also, could you delete loaded_model? It seems to use some GPU memory without being used.

manish · April 19, 2018, 9:22pm

Yeah i runned it removingtorch.cude.set_device(gpu) . It runs but uses only 1 GPU. What i would like to do is say run batch1 in 1 GPU , second batch in another GPU and merge them together. Am i manually supposed to do that ?

Even training 8 epochs took me 24 hours with 1 GPU. I would like to speed this up .

ptrblck · April 19, 2018, 9:53pm

This is done by DataParallel automatically. The batch is split onto the different GPUs.
I still don’t see, why your code only uses one GPU.

manish · April 19, 2018, 9:56pm

No Idea… that’s why i shared code here in case i am missing some minute detail.

ptrblck · April 19, 2018, 10:09pm

Since I cannot run the code on my machine, could you create a small code snippet with random input etc. so that I could debug the code?

manish · April 19, 2018, 11:04pm

Since project contains a lot of files, creating small snippet seems a difficult task to me. Can to suggest me how to debug the code for multi-gpu support ?

Can you think of any other reason why DataParallel not working based on your experience with pytorch and cuda ?

@ptrblck Thanks a lot for taking time to help me out

manish · April 20, 2018, 4:54am

Hey I am using https://github.com/pentiumx/gte-vnmt-pytorch/ as a starting point to my project. Running this in multi-gpu will be helpful.

Can you have a look ?

ptrblck · April 20, 2018, 8:22am

Could you walk me through the code a bit, so that it won’t take that much time to read all functions.
First I suppose I have to run gte_vae_pretrain.py and then just gte.py?

manish · April 20, 2018, 9:21am

just run gte_vae.py only… there are many extra models tried… (so other names may be misleading) .

look intogte_vae.py only and different pytorch modules called from there.

ptrblck · April 20, 2018, 9:29am

Thanks for the info, unfortunately I need some additional files:

glove.42B.300d
.vector_cache/input_vectors.pt
.data/snli/snli_1.0_entail/snli_1.0_train_entail.jsonl

Could I create some of them with random values, since I don’t need the accuracy just an executable code?

manish · April 20, 2018, 10:11am

Make snli_1.0_train_entail.jsonl a line seperated file in the required directory , like :

{sentence1 : "Ram is Good" , sentence2 : "Shyam is good", label : "ab"}
{sentence1 : "Food was awesome" , sentence2 : "Any sentence", label : "bc"}

sentence1 , sentence2 , label are keys. And same goes with test and dev files.

and make random values for others

Vivek_Malipatel · February 8, 2024, 3:17am

HI, Were you able to solve this issue?