Multi node, multi gpu system

shrutishrestha · November 4, 2022, 6:23pm

Hi, I am trying to use multi gpu while running my code. I tried with the code provided in pytorch documentation, but its not working.
Could anyone please look at this once?
The thing is I was able to run program in multiple gpu multiple node, using distributed data parallel. But the gradients were not collected as the accuracy and loss are all zero after the first epoch.

Full code is presented here:

github.com

shrutishrestha/DeepLearning_with_Multinode_Multigpu/blob/main/DistributedDataParallel.py

import os
import csv
import cv2
import torch
from datetime import datetime
import numpy as np
import pandas as pd
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

import cProfile, pstats
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

#for neural network
from torchvision import datasets, transforms

This file has been truncated. show original

wanchaol · November 8, 2022, 7:43pm

HI @shrutishrestha I looked into your code, if you want to use nccl pg, better set up the cuda device of each process to local rank via torch.cuda.set_device

shrutishrestha · November 9, 2022, 10:45pm

Thanks for your reply @wanchaol, but when I tried to use nccl, it was giving me error, so I used gloo for this

fduwjj · November 10, 2022, 12:35am

What environment are you running your code? Because Gloo is for CPU training.

shrutishrestha · November 10, 2022, 7:22am

I am running my code in GPU. I will check if nccl works for GPU.