Freeze when using garbage collector for debugging

vkubicki · February 2, 2021, 10:37am

I am using a modified version of the SAC algorithm presented at: spinningup/spinup/algos/pytorch/sac at master · openai/spinningup · GitHub . The memory consumption slowly increase until the process is killed. Following recommendations from How to debug causes of GPU memory leaks?, I created this function:

def gc_n():
    n = 0
    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
                n += 1
        except:
            pass
    return n

When I run it, I am capable of printing n to a file, however, the code seems to freeze afterward, with no possibility to kill via CRTL+C, and the message:

/home/kubicki/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:126: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn("torch.distributed.reduce_op is deprecated, please use "

If I use a simpler function:

def gc_simple_n():
    return len(gc.get_objects())

The run proceeds normally, but since every Python object is counted, the result is much harder to interpret.

Could you help me understand the error and get the correct number of tensors ?

vkubicki · February 2, 2021, 11:05am

Pytorch was installed using pip install torch torchvision, but I did not use the GPU. I reinstalled it using pip install torch==1.7.1+cpu torchvision==0.8.2+cpu torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html, and gc works without freeze. I still get the warning, though.