I’m getting a weird OOM issue when training my model on GPU. My model has very few parameters, just an embedding layer(about 20000 x 300) and a matrix param(300 x 20000). In theory it should only consumes several hundreds MB of space in memory and can easily fit into GPU, however during training my GPU memory consumption will skyrockets to over 10GB after running for just a few minutes.
Here is the code of my model
class Model(nn.Module): def __init__(self, hidden_size, embedding, layer_num=1): super(Bug, self).__init__() self.layer_num = layer_num self.hidden_size = hidden_size self.embedding = embedding self.voc_size = embedding.vocab_size self.embed_size = embedding.embedding_size self.param = nn.Linear(self.hidden_size, self.vocab_size) def forward(self, inputs, lengths): emb = self.embedding.embedding(inputs) out = self.param(embed) # (L,B,vocab_size) return out
And here is my training loop
for epoch in range(epoches): for idx, batch in enumerate(next_batch(lines, BATCH_SIZE)): optimizer.zero_grad() pad_sents, lengths, pad_labels, mask, _, _ = batch2train(emb, batch) out = model(pad_sents, lengths) ret = out labels = pad_labels ret = ret.view(ret.size(0) * ret.size(1), -1) labels = labels.view(-1) loss = criterion(ret, labels) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 0.25) optimizer.step()
I tries to print all my tensor objects using the following code
def get_tensors(self): for obj in gc.get_objects(): if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)): yield tensor
It seems like there are not many tensor objects used during training but the overall memory consumption is abnormally high. Any idea what might be the cause and how can I print out all objects reside in GPU memory?