Hello, I am trying to implement multi-GPU training and testing and I am running it through jupyter notebook.
This may be a driver specific issue in my local machine as I dont seem to have this problem while running it on AWS p2.x8large instances.
EDIT : Just found another case within the aws instance where the gpu does not get free , is there any way to force the gpu to be free within python itself ?
I also noticed that when I load from the same dataset using the same set of gpus from two different processes they both become zombie processes
I make the model parallel by doing
and then pass the input and label to cuda as
input = input.cuda()
But when I interrupt the process and restart it seems like GPU memory is not free as nvidia-smi shows that portions of the GPU are still occupied, running the process again returns an out of memory error.
the entire code can be found here https://github.com/gtm2122/viewclassification/blob/master/view_testls/nndev.py
the relevant lines that use DataParallel are and sending to cuda are line 264 and line 292 in model_pip.train_model()
and in model_pip.test() they are in line 405 and 415
from my jupyter notebook this is my usage
res = models.resnet18(pretrained=False) res.fc = nn.Linear(res.fc.in_features,7) obj = model_pip(model_in=res,scale=True,batch_size=120,use_gpu=True,gpu=[0,1],data_path=datapath,lr=0.1,lr_decay_epoch=30) model = obj.train_model(epochs=50) obj.store_model(f_name=savepath) del(obj) res = models.resnet18(pretrained=False) res.fc = nn.Linear(res.fc.in_features,7) obj = model_pip(model_in=res,scale=True,batch_size=120,use_gpu=True,gpu=[0,1],data_path=datapath,lr=0.1,lr_decay_epoch=30) obj.load_model(filename=savepath) obj.test(...)