Cuda out of memory during evaluation (tried everything)

Hi,
I’m facing a cuda OOM error during evaluation only in the prediction phase (model(inputs))
Training loop works fine.
few points:

  1. It happens with and without training loop
  2. The train and evaluation loop are in separate functions (each epoch each function called - does not looks like scope issue)
  3. I am using with torch.no_grad() and model.eval()
  4. train and eval loaders has same data types, same shapes and same batch sizes.
  5. This framework works with lot’s of models. The problem occur in new model I implemented - so I suspect that the problem is in the model and not in the code around it
  6. The problem occur with and without mixed precision, but if I use torch.cuda.amp.autocast() before evaluation - the problem is solved. (Still I want to know why)
  7. If I use torch.cuda.clear_cash() inside the forward() method - problem is allegedly solved.

My model has 5 layers from the same kind and the input grows becaause of number of filters of convolutional layers.
initial input size is (2048, 10, 256) == (B, C, N) and
The forward function of my model:

def forward(self, x, info_dict=None):
    self.process_info_dict(info_dict)
    tmp_res = []
    for d, m in self.slice_layers.items():
        if d == "remainder":
            tmp_res.append(m(x[:, self.slice_dict[d]]))
        else:
            b, c, n = (x.shape[0], len(self.slice_dict[d]), x.shape[2])
            tmp_x = x[:, self.slice_dict[d]].view(-1, 1, n)
            res = m(tmp_x)
            res = res.view(b, c * self.n_filters, self.output_dim[-1])
            tmp_res.append(res)
    x = torch.cat(tmp_res, 1)
    self.update_info_dict(info_dict)
    return x

When info_dcit is a dictionary of few hundreds of integers
The error occur in the line
x = torch.cat(tmp_res, 1)
The for loop is executed only 3 times.
The model layers are defined here:

def create_layers(self, out_h, dtype_slices, kernel_size, dilation, padding, stride, groups):
        for i, (d, layer_out_h) in enumerate(out_h.items()):
            block_layers = []
            in_channels = dtype_slices[d] if d == "remainder" else 1
            out_channels = layer_out_h if d == "remainder" else int(layer_out_h / dtype_slices[d])
            layer = torch.nn.Conv1d(in_channels, out_channels, kernel_size=kernel_size, 
                                                   dilation=dilation,
                                                  padding=padding, stride=stride, groups=groups[d])
            block_layers.append(layer)
            if self.activation:
                block_layers.append(self.activation)
            self.output_dim = conv_output_shape(self.output_dim, kernel_size, stride, padding)[0]
            self.slice_layers[d] = (torch.nn.Sequential(*block_layers))

and all the layers save in:
self.slice_layers = torch.nn.ModuleDict()

error message:
CUDA out of memory. Tried to allocate 5.00 GiB (GPU 0; 21.99 GiB total capacity; 10.30 GiB already allocated; 2.29 GiB free; 19.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any idea what to check and how?
Thanks!