Hello! I have a NN that is trained to predict the output of an equation. It is a very basic fully-connected NN. I used it for weeks perfectly fine, but for some reason today I started to get the
CUDA error: out of memory error. I reduced the batch size to 1 (initially it was working just fine with a batch size of 1024) and I am still getting this error. I am running this on a shared linux machine (however I am the only user most of the time). I logged in and out several times and I am still getting that error. I tried
nvidia-smi and I noticed that the GPU usage was 99-100%, however, later, that value dropped to 0% and I am still getting the error. I am really new to AI and I don’t know anything about GPU’s. Does anyone have any idea what could the problem be? Thank you!
Update: Here is a part of the code:
for filename in os.listdir(pathdir): print(filename) for i in range(1): n_variables = np.loadtxt(pathdir+"/%s" %filename, dtype='str').shape-1 variables = np.loadtxt(pathdir+"/%s" %filename, usecols=(0,)) if n_variables==0: print("Solved! ", variables) continue elif n_variables==1: variables = np.reshape(variables,(len(variables),1)) else: for j in range(1,n_variables): v = np.loadtxt(pathdir+"/%s" %filename, usecols=(j,)) variables = np.column_stack((variables,v)) f_dependent = np.loadtxt(pathdir+"/%s" %filename, usecols=(n_variables,)) f_dependent = np.reshape(f_dependent,(len(f_dependent),1)) factors = torch.from_numpy(variables[0:100000]) print(len(factors)) factors = factors.cuda()
It seems that the error appears in the last line, at
factors = factors.cuda(). Factors is a torch tensor of size 4 x 100000. I made it size 4 x 1 and still running out of CUDA memory.