DataParallel causing Zombie process and filled GPU

Hello, I am trying to implement multi-GPU training and testing and I am running it through jupyter notebook.
This may be a driver specific issue in my local machine as I dont seem to have this problem while running it on AWS p2.x8large instances.

EDIT : Just found another case within the aws instance where the gpu does not get free , is there any way to force the gpu to be free within python itself ?

I also noticed that when I load from the same dataset using the same set of gpus from two different processes they both become zombie processes

I make the model parallel by doing

nn.DataParallel(model,device_ids=[0,1]).cuda()

...

and then pass the input and label to cuda as
input = input.cuda()
label=label.cuda()

But when I interrupt the process and restart it seems like GPU memory is not free as nvidia-smi shows that portions of the GPU are still occupied, running the process again returns an out of memory error.

the entire code can be found here https://github.com/gtm2122/viewclassification/blob/master/view_testls/nndev.py

the relevant lines that use DataParallel are and sending to cuda are line 264 and line 292 in model_pip.train_model()

and in model_pip.test() they are in line 405 and 415

from my jupyter notebook this is my usage

res = models.resnet18(pretrained=False)
res.fc = nn.Linear(res.fc.in_features,7)
obj = model_pip(model_in=res,scale=True,batch_size=120,use_gpu=True,gpu=[0,1],data_path=datapath,lr=0.1,lr_decay_epoch=30)
model = obj.train_model(epochs=50)
obj.store_model(f_name=savepath)
del(obj)
res = models.resnet18(pretrained=False)
res.fc = nn.Linear(res.fc.in_features,7)

obj = model_pip(model_in=res,scale=True,batch_size=120,use_gpu=True,gpu=[0,1],data_path=datapath,lr=0.1,lr_decay_epoch=30)
obj.load_model(filename=savepath)
obj.test(...)
1 Like

If you are using an older version of pytorch I think I remember there being an issue with that so upgrading pytorch newest version would help. If not and in addition to that when you interrupt training use Ctrl C to end your training and that should kill the processes properly and free up the gpu memory