Unexpected out of memory during inference

I have a model that need to do log_softmax on a tenser of shape (batch_size, x, y, z) during inference. And then I can do gather to get a tensor of shape (batch_size, x, y, 1) from it. For most of the cases, x is smaller than 2000 and it works fine, but when the model encounters an example with x around 3000, it reports CUDA out of memory during the computation of log_softmax, because (batch_size, x, y, z) is indeed a large tensor. So I try to split x into several shards for which each shard has a size no larger than 1000 then first do log_softmax and gather independently for each shard and finally cat the results from different shards. However, it still reports CUDA out of memory during the computation of log_softmax. I feel this unexpected because the computation for different shards are totally independent. In other words, when computing the next shard, the previous results of log_softmax can be totally released. Should I manually release the memory by calling del after the computation of each shard?
Also, I am wondering why doing several computation of size (batch_size, 1000, y, z) within one forward pass will result in OOM while doing many computation of size (batch_size, 2000, y, z) across different forward pass never lead to OOM, considering the computation in both scenario doesn’t really depends on the previous results of log_softmax? Does PyTorch have any memory releasing mechanism at the end of every forward pass?

Did you wrap the code in a with torch.no_grad() block, since you are dealing with an inference use case?
This will make sure to avoid storing the intermediate tensors, which would be needed to calculate the gradients during the backward call.

If you’ve already done that, could you check how close you are to the memory limit on your device using dim1=2000? If you are close to an OOM, creating multiple instances of smaller tensors, could yield the OOM e.g. due to memory fragmentation.