I’ve got a model with 110M parameters, and I’m training it on a very small dataset (like 500 examples).

Yet that is enough to crash Pytorch on a K80 GPU with 11GB VRAM.
What is going on here? 110M x 4 (float size) = 440M = 0.440 GB + minuscule dataset size != 11GB VRAM…

I would be very surprised if just the float size is taken up by your model. Especially when it is training, it has to keep track of gradients. And there are probably lots of other things which need to be in memory for the model to work.

See if you get the out of memory error if you train the network on a batch with just one input. If that works, try increasing the number of inputs in a batch till you figure out the breaking point.

Another thing: perhaps the GPU memory is being used by other processes, so that not all of the 11GB is available? The output of nvidia-smi should tell you if this is the case or not.

You think those other things could increase the memory burden roughly x20?

Regarding your Suggestions:
I have tried increasing the batch size slowly, it works with very small batch sizes like 10 I think.
And yes I’ve checked with nvidia-smi, I’m the only one using this machine and there are no zombie processes.

Extra Context:
Also I noticed there could be a memory leak in my code, in the sense that the mini batch tensor is reallocated at each update step & it may not be garbage collected quickly enough. Still I tried fixing this with a reusable mini-batch Tensor and it still didn’t work…

Each element is 200 floats so 800 bytes.
Here is the brief version of my code (with only minimal memory leak patches):

# pad_len := max number of tokens (input sequences padded to this length)
def train_loop(model, input_output_data, epochs=5, batch_size=32, pad_len=200):
input, output = input_output_data
n_examples = len(input)
model.train() # turn on training
optimizer = optim.Adam(model.parameters(), lr=0.001)
n_batches = int(n_examples/batch_size+0.99999)
for i in range(epochs):
print(f'epoch: {i+1}/{epochs}')
for j in range(n_batches):
optimizer.zero_grad()
torch.cuda.empty_cache()
try:
# they must be padded to the same size for batching to work...
inputs = tokenizer(input[j*batch_size:(j+1)*batch_size], return_tensors='pt',
padding='max_length', truncation=True, max_length=pad_len)
outputs = tokenizer(output[j*batch_size:(j+1)*batch_size], return_tensors='pt',
padding='max_length', truncation=True, max_length=pad_len)
loss = model(**inputs, labels=outputs['input_ids'])[0].mean()
except Exception as e:
#pdb.set_trace()
raise
print(f'batch: {j+1}/{n_batches}, loss: {loss}')
loss.backward() # mean is unnecessary but just for safety
optimizer.step()

The memory usage is model-dependent and often the majority of the memory is used by the forward activations, not the parameters or gradients.
E.g. in this post I’ve posted some stats about the used model and you can see that the activations use:

It really depends on the model architecture and especially for e.g. conv layers, you would see a huge memory difference, while linear layers could yield the inverse effect.
Here is a smaller example:

# conv
model = nn.Conv2d(3, 64, 3, 1, 1)
x = torch.randn(1, 3, 224, 224)
out = model(x)
model_param_size = sum([p.nelement() for p in model.parameters()])
input_size = x.nelement()
act_size = out.nelement()
print('model size: {}\ninput size: {}\nactivation size: {}'.format(
model_param_size, input_size, act_size))
> model size: 1792
input size: 150528
activation size: 3211264
# linear
model = nn.Linear(1024, 1024)
x = torch.randn(1, 1024)
out = model(x)
model_param_size = sum([p.nelement() for p in model.parameters()])
input_size = x.nelement()
act_size = out.nelement()
print('model size: {}\ninput size: {}\nactivation size: {}'.format(
model_param_size, input_size, act_size))
> model size: 1049600
input size: 1024
activation size: 1024