Pytorch appears to be crashing due to OOM prematurely?

I’ve got a model with 110M parameters, and I’m training it on a very small dataset (like 500 examples).

Yet that is enough to crash Pytorch on a K80 GPU with 11GB VRAM.
What is going on here? 110M x 4 (float size) = 440M = 0.440 GB + minuscule dataset size != 11GB VRAM…

Thanks for your help

I would be very surprised if just the float size is taken up by your model. Especially when it is training, it has to keep track of gradients. And there are probably lots of other things which need to be in memory for the model to work.

See if you get the out of memory error if you train the network on a batch with just one input. If that works, try increasing the number of inputs in a batch till you figure out the breaking point.

Another thing: perhaps the GPU memory is being used by other processes, so that not all of the 11GB is available? The output of nvidia-smi should tell you if this is the case or not.

You think those other things could increase the memory burden roughly x20?

Regarding your Suggestions:
I have tried increasing the batch size slowly, it works with very small batch sizes like 10 I think.
And yes I’ve checked with nvidia-smi, I’m the only one using this machine and there are no zombie processes.

Extra Context:
Also I noticed there could be a memory leak in my code, in the sense that the mini batch tensor is reallocated at each update step & it may not be garbage collected quickly enough. Still I tried fixing this with a reusable mini-batch Tensor and it still didn’t work…

How big is each element of the input data set? Perhaps it is the data loader which is taking up lots of memory?

Each element is 200 floats so 800 bytes.
Here is the brief version of my code (with only minimal memory leak patches):

# pad_len := max number of tokens (input sequences padded to this length)
def train_loop(model, input_output_data, epochs=5, batch_size=32, pad_len=200):
    input, output = input_output_data
    n_examples = len(input)
    
    model.train() # turn on training
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    n_batches = int(n_examples/batch_size+0.99999)
    for i in range(epochs):
        print(f'epoch: {i+1}/{epochs}')
        for j in range(n_batches):
            optimizer.zero_grad()
            torch.cuda.empty_cache()
            try:
                # they must be padded to the same size for batching to work...
                inputs = tokenizer(input[j*batch_size:(j+1)*batch_size], return_tensors='pt',
                                   padding='max_length', truncation=True, max_length=pad_len)
                outputs = tokenizer(output[j*batch_size:(j+1)*batch_size], return_tensors='pt',
                                    padding='max_length', truncation=True, max_length=pad_len)
                loss = model(**inputs, labels=outputs['input_ids'])[0].mean()
            except Exception as e:
                #pdb.set_trace()
                raise
            
            print(f'batch: {j+1}/{n_batches}, loss: {loss}')
            loss.backward() # mean is unnecessary but just for safety
            optimizer.step()

I can’t see anything in this code that suggests large memory use. As a band-aid I can suggest that you could try adding

del inputs
del outputs
del loss

after the line

optimizer.step()

.

The memory usage is model-dependent and often the majority of the memory is used by the forward activations, not the parameters or gradients.
E.g. in this post I’ve posted some stats about the used model and you can see that the activations use:

6452341200 / 138357544 ~= 47

times more memory than the parameters.

Wow that’s astonishing. I heard it was more but I had no idea it was this much more.

It really depends on the model architecture and especially for e.g. conv layers, you would see a huge memory difference, while linear layers could yield the inverse effect.
Here is a smaller example:

# conv
model = nn.Conv2d(3, 64, 3, 1, 1)
x = torch.randn(1, 3, 224, 224)

out = model(x)

model_param_size = sum([p.nelement() for p in model.parameters()])
input_size = x.nelement()
act_size = out.nelement()

print('model size: {}\ninput size: {}\nactivation size: {}'.format(
    model_param_size, input_size, act_size))

> model size: 1792
  input size: 150528
  activation size: 3211264
  
# linear
model = nn.Linear(1024, 1024)
x = torch.randn(1, 1024)

out = model(x)

model_param_size = sum([p.nelement() for p in model.parameters()])
input_size = x.nelement()
act_size = out.nelement()

print('model size: {}\ninput size: {}\nactivation size: {}'.format(
    model_param_size, input_size, act_size))

> model size: 1049600
  input size: 1024
  activation size: 1024

Hmm that’s a good point because of weight sharing… Does my transformer have weight sharing (I’m using XLNet)? I think yes because of recurrence right?

Also I guess the remedy is pipeline parallelism right?