I have a model that need to do log_softmax
on a tenser of shape (batch_size, x, y, z)
during inference. And then I can do gather
to get a tensor of shape (batch_size, x, y, 1)
from it. For most of the cases, x
is smaller than 2000 and it works fine, but when the model encounters an example with x
around 3000, it reports CUDA out of memory
during the computation of log_softmax
, because (batch_size, x, y, z)
is indeed a large tensor. So I try to split x
into several shards for which each shard has a size no larger than 1000 then first do log_softmax
and gather
independently for each shard and finally cat
the results from different shards. However, it still reports CUDA out of memory
during the computation of log_softmax
. I feel this unexpected because the computation for different shards are totally independent. In other words, when computing the next shard, the previous results of log_softmax
can be totally released. Should I manually release the memory by calling del
after the computation of each shard?
Also, I am wondering why doing several computation of size (batch_size, 1000, y, z)
within one forward pass will result in OOM
while doing many computation of size (batch_size, 2000, y, z)
across different forward pass never lead to OOM
, considering the computation in both scenario doesn’t really depends on the previous results of log_softmax
? Does PyTorch have any memory releasing mechanism at the end of every forward
pass?
Did you wrap the code in a with torch.no_grad()
block, since you are dealing with an inference use case?
This will make sure to avoid storing the intermediate tensors, which would be needed to calculate the gradients during the backward call.
If you’ve already done that, could you check how close you are to the memory limit on your device using dim1=2000
? If you are close to an OOM, creating multiple instances of smaller tensors, could yield the OOM e.g. due to memory fragmentation.