RuntimeError: reduce failed to get memory buffer: out of memory - After 30,000 iterations

I am training a model that uses about 10GB of memory. My GPU 11GB of ram. Training seems to progress fine for about 2 epochs (30,000 iterations) then I suddenly get this error. Why does it say that I do not have enough memory? Could it be something in the code that is slowly increasing ram usage ?

I just read about pin_memory and found out that I have it set to true in my dataloader. Could this be the most probable reason as to why I randomly get oom issues?

Traceback (most recent call last):
  File "/usr/lib/python3.6/", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/", line 85, in _run_code
    exec(code, run_globals)
  File "/home/haziq/openpifpaf_crm_pose/openpifpaf/", line 189, in <module>
  File "/home/haziq/openpifpaf_crm_pose/openpifpaf/", line 185, in main
    trainer.loop(train_loader, val_loader, args.epochs, start_epoch=start_epoch)
  File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/", line 91, in loop
    self.train(train_scenes, epoch)
  File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/", line 190, in train
    loss, head_losses = self.train_batch(data1, target1, meta1, data2, target2, meta2, apply_gradients)
  File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/", line 108, in train_batch
    loss1, head_losses1 = self.loss(outputs1, targets1, head="pifpaf")
  File "/home/haziq/env_openpifpaf/lib/python3.6/site-packages/torch/nn/modules/", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/", line 101, in forward
    for l, f, t in zip(self.losses_pifpaf, head_fields, head_targets)
  File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/", line 102, in <listcomp>
    for ll in l(f, t)]
  File "/home/haziq/env_openpifpaf/lib/python3.6/site-packages/torch/nn/modules/", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/", line 279, in forward
    for x_scale, scale_to_kp in zip(x_scales, self.scales_to_kp)
  File "/home/haziq/openpifpaf_crm_pose/openpifpaf/network/", line 279, in <listcomp>
    for x_scale, scale_to_kp in zip(x_scales, self.scales_to_kp)
  File "/home/haziq/env_openpifpaf/lib/python3.6/site-packages/torch/nn/", line 2231, in l1_loss
    ret = torch._C._nn.l1_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: reduce failed to get memory buffer: out of memory

Are you storing some tensors attached to the computation graph?
If so, this will increase your memory usage for each storage.

Also, you could (possibly) reduce the memory footprint by using functions for training and evaluation, since Python uses function scoping as described here.

Hello thank you for the suggestions. I will check further.

I have also edited my original post to indicate that I have pin_memory set to true. I read in some threads that it might result in oom What is the disadvantage of using pin_memory? and I was wondering if this might be the case for me?

Are you using the GPU to train your model and run out of RAM?
I assumed you are not using a GPU, as usually your system RAM should be much larger than the GPU memory.

I am sorry. I am using the GPU to train my model. It has 11GB of memory. I misused the word RAM for GPU memory. I have edited my original post.

Thanks for the clarification. Usually you would get a CUDA error if you run OOM, so I was just wondering.

Do you see increased usage of your GPU memory in nvidia-smi?
I’m not sure if the usage of pin_memory might cause an OOM error, but just disable it and see if anything changes.

The model ran for like 8 hours before the error occurred. I think I looked at the output of nvidia-smi only twice during that period and did not notice a change. I was using up this amount 10873MiB / 11178MiB. Then I came back many hours later and saw the memory error message.

What kind of model are you using?
Are you dealing with some dynamic sizes, e.g. varying batch shapes?
Also, did you make sure to write functions for the train and validation method?

Are you running other processes on this GPU, e.g. your desktop output, or are you using this GPU just for computation?

Thank you for your continued help. I am using a ResNet50 with several heads built on top of it for multi task learning. It is actually a modification of this code.

Also, did you make sure to write functions for the train and validation method?

Yes they are in separate functions. Line 89 of

for epoch in range(start_epoch, epochs):
    self.train(train_scenes, epoch)

    self.write_model(epoch + 1, epoch == epochs - 1)
    self.val(val_scenes, epoch + 1)

Are you dealing with some dynamic sizes, e.g. varying batch shapes?

Kind of. I have 2 data loaders which have different batch sizes.

So my version has 2 dataloaders. One loads the mscoco dataset and another the volleyball dataset. It also has 2 heads and at each iteration, I pass the mscoco data through the ResNet basenet and through one of the heads before doing a backprop. Then I pass the volleyball data through ResNet and through the other head. Below is a simple description of my model.

Could this be the reason why I am running oom? Since the memory required by the model depends on the data that I am feeding through.

class Shell(torch.nn.Module):
    def __init__(self, base_net, head_nets):
        super(Shell, self).__init__()

        self.base_net    = base_net
        self.head_nets   = torch.nn.ModuleList(head_nets)
        self.head_pifpaf = self.head_nets[:2]
        self.head_crm    = self.head_nets[2:]

    def io_scales(self):
        return [self.base_net.input_output_scale // (2 ** getattr(h, '_quad', 0))
                for h in self.head_nets]

    def forward(self, x, head):  # pylint: disable=arguments-differ
        x = self.base_net(x)
        if head=="pifpaf":
            return [hn(x) for hn in self.head_pifpaf]
        if head=="crm":
            return [hn(x) for hn in self.head_crm]

Thanks for the code!
Could you lower the batch sizes of both DataLoaders and check the memory usage on your GPU?
It would be interesting to see the peak usage, and where it occurs.
Usually it’s not a bad idea to keep some “spare” memory on your GPU so that small fluctuations won’t cause an OOM.

Thank you. The problem is that I am using a shared workstation and so other students can run their code and also take up any remaining memory. So I am worried that the small fluctuations may also cause an oom.

For example, the GPU is free at 18MiB / 11178MiB before I run my model. Then it goes up to 10873MiB / 11178MiB when I start. During this time any other student can run their code provided it does not exceed the remaining memory. So if the GPU is now fully utilized then any fluctutations from my code will result in an oom.

That might be problematic. Since you almost use the complete GPU, maybe you could ask for a time slot where you could use this GPU for your training only?

Thank you yes I will do that. I was also hoping if there is a way for pytorch to fully allocate the memory required by my program. I was hoping that setting pin_memory to false would do the trick.

As far as I know there is no straightforward and convenient way of allocating all memory.

The pytorch memory usage won’t be constant over time, and the other students’ code might allocate a fixed amount for themselves, which in turn might crash your program when it tries to access more memory… Try as @ptrblck suggested!