Program fails cuda-memcheck

neilmehta87 · June 19, 2020, 12:36am

Hi,

I am new to pytorch and I am trying to write a simple NN code capable of running on GPU. I have attached the code (part of a larger code) which I have written to demonstrate my problem. I am able to compile and run my code without any errors. However, when I try to run it using cuda-memcheck, I get bunch of errors, which all state:
Program hit cudaErrorCudartUnloading (error 4) due to “driver shutting down” on CUDA API call to cudaFree/cudaDeviceSynchronize/cudaEventDestroy etc.
I am using pytorchv1.5.0-gpu, gcc/8.3.0, cuda 10.2.89, python-3.7-anaconda, and running my code on volta v100.

I am attaching my code

import torch
import numpy as np

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in,H)
        self.linear2 = torch.nn.Linear(H,D_out)

    def forward(self, x): 
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

def run_NN():
    N, D_in, H, D_out, tstep = 64, 1000, 100, 10, 1000
    print("*********************************************")
    print("Start python")

    dtype = torch.float;
    device = torch.device("cuda:0")

    x = torch.randn(N, D_in, dtype=dtype, device=device)
    y = torch.randn(N, D_out, dtype=dtype, device=device)
       
    w1 = torch.randn(D_in, H, device=device, dtype=dtype)
    w2 = torch.randn(H, D_out, device=device, dtype=dtype)  

    model = TwoLayerNet(D_in, H, D_out)
    y_pred = torch.matmul(x,w1)

    print("End python")
    print("*********************************************")

run_NN()

Could someone please help me?
Thank you

Edit: I tried cuda-memcheck on the examples listed at Simple NN examples Pytorch and I get the same errors.

ptrblck · June 19, 2020, 9:00am

The error is raised, as cuda-memcheck is encountering an error, not because it’s finding an issue in the code.
I would recommend to reinstall CUDA and try a new installation of cuda-memcheck.

neilmehta87 · June 19, 2020, 5:50pm

Hi ptrblck,
Thank you for responding.
I am using cori NERSC and CUDA is installed centrally on the system. Installed CUDA version is used and tested by others without issue, and cuda-memcheck has been successfully used on multiple other codes.
Could my issue be related to the response posted by @ezyang in https://github.com/pytorch/pytorch/issues/11858#issuecomment-475359500?

I also tried using binary install of pytorch inside a conda environment and the cuda-memcheck related errors still persist.

Please let me know if I should provide any other information.
Thank you!

ptrblck · June 20, 2020, 7:37am

I don’t think so, as I’m using cuda-memcheck regularly and the linked comment is a bit older by now.

You could try to use a docker container and check, if cuda-memcheck is working in this environment.