Memory error while training a variable sequence length LSTM

CUDA out of memory. Tried to allocate 17179869176.57 GiB (GPU 0; 15.90 GiB total capacity; 8.57 GiB already allocated; 6.67 GiB free; 8.58 GiB reserved in total by PyTorch)

I am working with a text dataset with 50 to 60 data points. Each sequence has about 200K tokens on an average. The maximum length sequence has about 500K tokens. GPU Memory is about 16 GB. Hence, it’s throwing a memory error. Any suggestions on how to circumvent this issue?

is quite large, so that it won’t fit on the device :wink:
Could you post the complete shape of your inputs?

If this memory footprint is expected and is not raised by a bug, then you would have to reduce the input shape and model significantly.

Yeah :sweat_smile:.

class Model(nn.Module):
    def __init__(self,vocab_len,input_size):
        super(Model,self).__init__()
        self.embed=nn.Embedding(num_embeddings=vocab_len,embedding_dim=128,)
        self.lstm=nn.LSTM(input_size=input_size,hidden_size=1024,num_layers=1)
        self.fc=nn.Linear(in_features=1024,out_features=1)
    def forward(self,x):
        x=self.embed(x)
        o,(ht,ct)=self.lstm(x)
        out=self.fc(ct)
        return nn.Sigmoid(out)

This is the model.
The input shape is (2,560874)[bs,seq_len).
Am I making a blinder?

The input tensor itself will just use ~4.2MB and the linear layer is also quite small.
How large is vocal_len and is input_size=560874?

Yes. The vocab_len is 513

Thanks for the information.
It seems the nn.LSTM might use too much memory for your device, as it would need ~9GB.
However, the error message is wrong and yields a wrong number. We’ve recently fixed a similar issue for convolutions, so I’ll check it with the latest master build.

1 Like

I figured it out. The text preprocessing was improper. Moving on to my next question.

class Model(nn.Module):
    def __init__(self,emb_sz,hidden_sz,num_c):
        super(Model,self).__init__()
        self.embed=nn.Embedding(num_embeddings=len(TEXT.vocab),embedding_dim=emb_sz,)
        self.lstm=nn.LSTM(input_size=emb_sz,hidden_size=hidden_sz,num_layers=1)
        self.fc=nn.Linear(in_features=hidden_sz,out_features=num_c)
    def forward(self,x):
        #print(x.size())
        x=x.permute(1,0)
        x=self.embed(x)
        #print(f'After embedding size {x.size()}')
        o,_=self.lstm(x)
        shape=o.size()
        o=o[:,-1,:]
        #o=o.view(shape[0],-1)
        #print(f'After LSTM size {o.size()}')
        out=self.fc(o)
        #print(f'After FC size {out.size()}')
        return F.sigmoid(out)

This my model.

def train(model,loss,opt,num_epochs,train_dl,valid_dl):
    model=model.cuda()
    for i in tqdm(range(num_epochs)):
        print('*'*5+f'Training epoch {i+1}'+'*'*5)
        for (x,y) in train_dl:
            model.train()
            x=Variable(x).cuda()
            y=torch.FloatTensor(y)
            y=Variable(y).cuda()
            pred=model(x)
            error=loss(pred,y)
            opt.zero_grad()
            error.backward()
            opt.step()
        for (x1,y1) in valid_dl:
            with torch.no_grad():
                x1=x1.cuda()
                y1=y1.cuda()
                model.eval()
                pred1=model(x1)
                error1=loss(pred1,y1)
            #print(pred1.size())
                preds=list(map(prediction,pred1))
                accuracy= accuracy_score(y1.cpu(),preds)
                print(f'Valid loss: {error1}\n Accuracy Score: {accuracy}')
        torch.save(model.state_dict,f'model{i+1}.pth')

This is my training loop.

I am getting a “CUDA out of memory. Tried to allocate 4.43 GiB (GPU 0; 15.90 GiB total capacity; 6.42 GiB already allocated; 4.42 GiB free; 10.83 GiB reserved in total by PyTorch) (malloc at /opt/conda/conda-bld/pytorch_1587428398394/work/c10/cuda/CUDACachingAllocator.cpp:289)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f14c4533b5e in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)” Error after running about 7 epochs.
Any insights would be appreciated…