Train a model using multiple gpus

mhwu · May 11, 2018, 9:53am

Issue Description

I tried to train my model on multiple gpus. However, when I launch the program, it hangs in the first iteration. Using nvidia-smi, i find hundreds of MB of memory is consumed on each gpu. I guess these memory usage is for model initialization in each gpu.

I am sharing 8 gpus with others on the server, so I limit my program on GPU 2 and GPU 3 by following command.

os.environ['CUDA_VISIBLE_DEVICES'] = '2,3'

Code example

I ran this official tutorial on my machine and the same thing happens again. https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

System Info

PyTorch version: 0.4.0
Is debug build: No
CUDA used to build PyTorch: 8.0.61

OS: Ubuntu 16.04.3 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 7.5.17
GPU models and configuration:
GPU 0: GeForce GTX TITAN X
GPU 1: GeForce GTX TITAN X
GPU 2: GeForce GTX TITAN X
GPU 3: GeForce GTX TITAN X
GPU 4: GeForce GTX TITAN X
GPU 5: GeForce GTX TITAN X
GPU 6: GeForce GTX TITAN X
GPU 7: GeForce GTX TITAN X

Nvidia driver version: 375.88
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.5.1.10
/usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.20
/usr/lib/x86_64-linux-gnu/libcudnn_static_v5.a
/usr/lib/x86_64-linux-gnu/libcudnn_static_v6.a
/usr/local/lib/python2.7/dist-packages/torch/lib/libcudnn.so.6

Versions of relevant libraries:
[pip3] msgpack-numpy (0.4.1)
[pip3] numpy (1.14.0)
[pip3] numpydoc (0.7.0)
[pip3] torch (0.4.0)
[pip3] torchtext (0.2.3)
[pip3] torchvision (0.2.1)
[conda] torch 0.4.0
[conda] torchtext 0.2.3
[conda] torchvision 0.2.1

Solution for those who face similar problem

ptrblck · May 11, 2018, 10:25am

Is the memory used on all GPUs or just the two you’ve selected?
Sometimes it can be problematic to set the env variable inside the python script, e.g. when CUDA was already initialized. The workaround would be to set the available GPUs before calling the script:

CUDA_VISIBLE_DEVICES=2,3 python script.py

How did you check your data is not fed into the model?

mhwu · May 11, 2018, 10:52am

@ptrblck
thank you for your reply.

I only run my program on two selected gpus.

I’ve tested your suggestion, but nothing changes.

Honestly, I didn’t explicitly check whether the data is feed the model, BUT, the for-loop used for the training process just doesn’t iterate, so I guess the data is not fed into the model.

The for-loop is not iterating. After consuming hundreds of MB of GPU memory, the program seems to be freezed.

I found another piece of information a few minutes ago, which might be helpful.
I know, it’s 2018, but the lab server is still using CUDA 7.5, but I’m using PyTorch 0.4.0 compiled with CUDA 8.0.
Please don’t ask me to switch back to PyTorch 0.3.0 with CUDA 7.5. I’ve tried, and there are a lot of horrible compatibility issues.

ptrblck · May 11, 2018, 11:26am

No, I wouldn’t switch back to an older version.
I think you should be fine, since the binaries ship with its own libs.
Could you check it with print(torch.version.cuda)?

What do you mean by “the for-loop is not iterating”?
Does your code just exit or does it hang in the loop?

Are you able to get a single sample from your DataLoader?

loader_iter = iter(loader)
data, target = loader_iter.next()

mhwu · May 11, 2018, 11:44am

@ptrblck
My dataloaders work very well on a single GPU, so I guess there is nothing wrong with it.

My code hangs in the first iteration
Here is part of my code.
In the first iteration, the “First checkpoint” is printed out but the “Second checkpoint” doesn’t.
The whole program is hung there.

for ids, Cw, Cc, Qw, Qc, a in tqdm.tqdm(train_loader):
    optimizer.zero_grad()
    print("First checkpoint")
    pred1, pred2 = model(Cw, Cc, Qw, Qc)
    print("Second checkpoint")
    torch.cuda.empty_cache()
    loss1 = F.cross_entropy(pred1, a[:, 0])
    loss2 = F.cross_entropy(pred2, a[:, 1])

This is cuda version information.

>>>print(torch.version.cuda)
>>>8.0.61

I’m not an English native.
Hopefully my description is clear enough.

Thank you.

ptrblck · May 11, 2018, 11:49am

Have a look at this suggestion from @ngimel. The thread deals with this issue.

mhwu · May 11, 2018, 12:50pm

I run following command and nothing was printed out on the screen.
Does this mean my machine passes the test?

nvcc p2pBandwidthLatencyTest.cu

It seems that after this step, some reboot is required. Well, I need to talk to the administrator next week.

Thank you for your help. I will add your suggestion to the main post.

Thanks again.