CUDA out of memory. Tried to allocate 17179869176.57 GiB (GPU 0; 15.90 GiB total capacity; 8.57 GiB already allocated; 6.67 GiB free; 8.58 GiB reserved in total by PyTorch)
I am working with a text dataset with 50 to 60 data points. Each sequence has about 200K tokens on an average. The maximum length sequence has about 500K tokens. GPU Memory is about 16 GB. Hence, it’s throwing a memory error. Any suggestions on how to circumvent this issue?
Thanks for the information.
It seems the nn.LSTM might use too much memory for your device, as it would need ~9GB.
However, the error message is wrong and yields a wrong number. We’ve recently fixed a similar issue for convolutions, so I’ll check it with the latest master build.
def train(model,loss,opt,num_epochs,train_dl,valid_dl):
model=model.cuda()
for i in tqdm(range(num_epochs)):
print('*'*5+f'Training epoch {i+1}'+'*'*5)
for (x,y) in train_dl:
model.train()
x=Variable(x).cuda()
y=torch.FloatTensor(y)
y=Variable(y).cuda()
pred=model(x)
error=loss(pred,y)
opt.zero_grad()
error.backward()
opt.step()
for (x1,y1) in valid_dl:
with torch.no_grad():
x1=x1.cuda()
y1=y1.cuda()
model.eval()
pred1=model(x1)
error1=loss(pred1,y1)
#print(pred1.size())
preds=list(map(prediction,pred1))
accuracy= accuracy_score(y1.cpu(),preds)
print(f'Valid loss: {error1}\n Accuracy Score: {accuracy}')
torch.save(model.state_dict,f'model{i+1}.pth')
This is my training loop.
I am getting a “CUDA out of memory. Tried to allocate 4.43 GiB (GPU 0; 15.90 GiB total capacity; 6.42 GiB already allocated; 4.42 GiB free; 10.83 GiB reserved in total by PyTorch) (malloc at /opt/conda/conda-bld/pytorch_1587428398394/work/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f14c4533b5e in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)” Error after running about 7 epochs.
Any insights would be appreciated…