Cuda memory leak?

I just started training a neural network on a new dataset, too large to keep in memory. The training goes well for a few hours but eventually it ran out of cuda memory, and I have been trying to figure out why.

The dataset is a protein dataset, where each sample can vary quite dramatically in size, so I figured it might be an issue that the largest samples were simply just too large for my GPU to train on. So I found the largest sample in the dataset and basically just locks my network to only train on that sample. Interestingly enough this particular sample does take up about 23196 / 24576 MB on my GPU when it hits the backprop the first time around, so it is relatively big but still with about 1.5 GB free.
However, repeatly training on this one sample doesn’t seem to make it crash within the first 100 iterations.

Each data sample is saved as an individual pt file and loaded when needed as so (note that right now the dataset is locked to just one sample (hence why it is using self.idx0,self.idx1 to select which file to load):


    def __getitem__(self, idx):
        protein = self.filenames[self.idx0][self.idx1]
        data_dict = torch.load(protein)
        seq = data_dict['seq']
        chain = data_dict['chain']
        rN = data_dict['rN']
        rC = data_dict['rC']
        rCA = data_dict['rCA']
        rCB = data_dict['rCB']
        bN = data_dict['bN']
        bC = data_dict['bC']
        bCA = data_dict['bCA']
        bCB = data_dict['bCB']
        E = data_dict['E']
        return seq, chain, rN, rC, rCA, rCB, bN, bC, bCA, bCB, E

The dataset is put into a standard pytorch dataloader with num_workers=0 (I know this could otherwise lead to memory problems).

My training routine looks like this (which runs one epoch on the data):

def run_model(net,optimizer,dataloader,device,loss_fnc,c,epoch,train_network=True):
    """
    A simple model runner routine, designed for either training or testing a model.
    """
    torch.set_grad_enabled(train_network)
    if train_network:
        net.train()
    else:
        net.eval()
    alossE = 0
    for i, (seq, chain, rN, rC, rCA, rCB, bN, bC, bCA, bCB, E ) in enumerate(dataloader):
        coords = torch.cat((rCA,rC,rN,rCB),dim=2).permute(0,2,1).to(device=device, non_blocking=True)
        E = E.to(device=device, dtype=torch.float32, non_blocking=True)
        Z = net(coords)
        n = coords.shape[-1]
        I = torch.arange(n).repeat(n)
        J = torch.repeat_interleave(torch.arange(n),n)
        Epred = Z[0,seq[0,I],seq[0,J],I,J].reshape(1,n,n)

        loss_E_abs = loss_fnc(E, Epred)
        loss_E_ref = loss_fnc(E, 0*E)
        loss_E = loss_E_abs / loss_E_ref

        loss = loss_E

        if train_network:
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_value_(net.parameters(), 0.1)
            optimizer.step()
        alossE += loss_E.detach().item()
    return alossE / (i + 1)

Here is the error message I get in one of the crashes:


Traceback (most recent call last):
  File "/snap/pycharm-community/274/plugins/python-ce/helpers/pydev/pydevd.py", line 1483, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/snap/pycharm-community/274/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/tue/PycharmProjects/PairwiseEnergy/EnergyPrediction/Example/train_energy_predictor.py", line 43, in <module>
    main(c)
  File "/home/tue/PycharmProjects/PairwiseEnergy/EnergyPrediction/MachineLearning/main.py", line 63, in main
    loss_E_train, loss_F_train = run_model(net,optimizer,dataset_train,c['device'],loss_fnc,c,epoch,train_network=True,viz=c['viz'])
  File "/home/tue/PycharmProjects/PairwiseEnergy/EnergyPrediction/MachineLearning/Optimizer.py", line 48, in run_model
    loss.backward()
  File "/home/tue/miniconda3/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/tue/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA out of memory. Tried to allocate 1.46 GiB (GPU 0; 23.70 GiB total capacity; 16.80 GiB already allocated; 1.40 GiB free; 19.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

This time it crashed in about 5000 iterations on the full dataset, before that it took 24000 iterations before crashing, in both cases it crashes on one of the really large samples, which makes sense. In both cases the cases it is crashing during the first epoch so it never leaves this training routine (which would otherwise take it into the same routine to run evaluation on the validation set).

I run this on a single RTX 3090 ti. It should also be noted that I have previously successfully run with this exact training script and neural network, on a smaller dataset that could be kept in memory, so I suspect that the issue is somehow related to the fact that I repeatedly load data files from HD.

I note that the error message states that 19.58 GiB is reserved by Pytorch, yet only 16.80 GiB is already allocated, but I’m not sure what this means exactly?

Edit:
I still have the session open in debugging at the time of the crash, so I ran the diagnostic commands that I could think of, but they don’t really seem to be telling me much new information.


torch.cuda.memory_allocated()
10216355328
torch.cuda.memory_allocated()/1024/1024
9743.07568359375
torch.cuda.max_memory_allocated()/1024/1024
18622.62744140625
torch.cuda.memory_reserved()/1024/1024
20054.0
torch.cuda.max_memory_reserved()/1024/1024
21364.0
torch.cuda.memory_stats()
OrderedDict([('active.all.allocated', 4403903), ('active.all.current', 165), ('active.all.freed', 4403738), ('active.all.peak', 205), ('active.large_pool.allocated', 2078248), ('active.large_pool.current', 94), ('active.large_pool.freed', 2078154), ('active.large_pool.peak', 103), ('active.small_pool.allocated', 2325655), ('active.small_pool.current', 71), ('active.small_pool.freed', 2325584), ('active.small_pool.peak', 170), ('active_bytes.all.allocated', 27959210139136), ('active_bytes.all.current', 10216355328), ('active_bytes.all.freed', 27948993783808), ('active_bytes.all.peak', 19527240192), ('active_bytes.large_pool.allocated', 27518735920128), ('active_bytes.large_pool.current', 10216019968), ('active_bytes.large_pool.freed', 27508519900160), ('active_bytes.large_pool.peak', 19526904832), ('active_bytes.small_pool.allocated', 440474219008), ('active_bytes.small_pool.current', 335360), ('active_bytes.small_pool.freed', 440473883648), ('active_bytes.small_pool.peak', 91639808), ('allocated_bytes.all.allocated', 27959210139136), ('allocated_bytes.all.current', 10216355328), ('allocated_bytes.all.freed', 27948993783808), ('allocated_bytes.all.peak', 19527240192), ('allocated_bytes.large_pool.allocated', 27518735920128), ('allocated_bytes.large_pool.current', 10216019968), ('allocated_bytes.large_pool.freed', 27508519900160), ('allocated_bytes.large_pool.peak', 19526904832), ('allocated_bytes.small_pool.allocated', 440474219008), ('allocated_bytes.small_pool.current', 335360), ('allocated_bytes.small_pool.freed', 440473883648), ('allocated_bytes.small_pool.peak', 91639808), ('allocation.all.allocated', 4403903), ('allocation.all.current', 165), ('allocation.all.freed', 4403738), ('allocation.all.peak', 205), ('allocation.large_pool.allocated', 2078248), ('allocation.large_pool.current', 94), ('allocation.large_pool.freed', 2078154), ('allocation.large_pool.peak', 103), ('allocation.small_pool.allocated', 2325655), ('allocation.small_pool.current', 71), ('allocation.small_pool.freed', 2325584), ('allocation.small_pool.peak', 170), ('inactive_split.all.allocated', 1943618), ('inactive_split.all.current', 32), ('inactive_split.all.freed', 1943586), ('inactive_split.all.peak', 79), ('inactive_split.large_pool.allocated', 1005096), ('inactive_split.large_pool.current', 28), ('inactive_split.large_pool.freed', 1005068), ('inactive_split.large_pool.peak', 73), ('inactive_split.small_pool.allocated', 938522), ('inactive_split.small_pool.current', 4), ('inactive_split.small_pool.freed', 938518), ('inactive_split.small_pool.peak', 56), ('inactive_split_bytes.all.allocated', 32105227474944), ('inactive_split_bytes.all.current', 2978925056), ('inactive_split_bytes.all.freed', 32102248549888), ('inactive_split_bytes.all.peak', 6537211904), ('inactive_split_bytes.large_pool.allocated', 31606285302272), ('inactive_split_bytes.large_pool.current', 2975066112), ('inactive_split_bytes.large_pool.freed', 31603310236160), ('inactive_split_bytes.large_pool.peak', 6535435776), ('inactive_split_bytes.small_pool.allocated', 498942172672), ('inactive_split_bytes.small_pool.current', 3858944), ('inactive_split_bytes.small_pool.freed', 498938313728), ('inactive_split_bytes.small_pool.peak', 33597440), ('max_split_size', -1), ('num_alloc_retries', 7), ('num_ooms', 1), ('oversize_allocations.allocated', 0), ('oversize_allocations.current', 0), ('oversize_allocations.freed', 0), ('oversize_allocations.peak', 0), ('oversize_segments.allocated', 0), ('oversize_segments.current', 0), ('oversize_segments.freed', 0), ('oversize_segments.peak', 0), ('reserved_bytes.all.allocated', 80834723840), ('reserved_bytes.all.current', 21028143104), ('reserved_bytes.all.freed', 59806580736), ('reserved_bytes.all.peak', 22401777664), ('reserved_bytes.large_pool.allocated', 80274784256), ('reserved_bytes.large_pool.current', 21023948800), ('reserved_bytes.large_pool.freed', 59250835456), ('reserved_bytes.large_pool.peak', 22305308672), ('reserved_bytes.small_pool.allocated', 559939584), ('reserved_bytes.small_pool.current', 4194304), ('reserved_bytes.small_pool.freed', 555745280), ('reserved_bytes.small_pool.peak', 96468992), ('segment.all.allocated', 417), ('segment.all.current', 30), ('segment.all.freed', 387), ('segment.all.peak', 154), ('segment.large_pool.allocated', 150), ('segment.large_pool.current', 28), ('segment.large_pool.freed', 122), ('segment.large_pool.peak', 108), ('segment.small_pool.allocated', 267), ('segment.small_pool.current', 2), ('segment.small_pool.freed', 265), ('segment.small_pool.peak', 46)])
torch.cuda.memory_snapshot()
[{'device': 0, 'address': 139748974788608, 'total_size': 1560281088, 'allocated_size': 82162176, 'active_size': 82162176, 'segment_type': 'large', 'blocks': [{'size': 82162176, 'state': 'active_allocated'}, {'size': 1478118912, 'state': 'inactive'}]}, {'device': 0, 'address': 139753454305280, 'total_size': 2097152, 'allocated_size': 56832, 'active_size': 56832, 'segment_type': 'small', 'blocks': [{'size': 1935360, 'state': 'inactive'}, {'size': 47616, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 104960, 'state': 'inactive'}]}, {'device': 0, 'address': 139754779705344, 'total_size': 1566572544, 'allocated_size': 0, 'active_size': 0, 'segment_type': 'large', 'blocks': [{'size': 1566572544, 'state': 'inactive'}]}, {'device': 0, 'address': 139756356763648, 'total_size': 1566572544, 'allocated_size': 0, 'active_size': 0, 'segment_type': 'large', 'blocks': [{'size': 1566572544, 'state': 'inactive'}]}, {'device': 0, 'address': 139757933821952, 'total_size': 1566572544, 'allocated_size': 0, 'active_size': 0, 'segment_type': 'large', 'blocks': [{'size': 1566572544, 'state': 'inactive'}]}, {'device': 0, 'address': 139759510880256, 'total_size': 1566572544, 'allocated_size': 1564994048, 'active_size': 1564994048, 'segment_type': 'large', 'blocks': [{'size': 1564994048, 'state': 'active_allocated'}, {'size': 1578496, 'state': 'inactive'}]}, {'device': 0, 'address': 139761087938560, 'total_size': 1566572544, 'allocated_size': 0, 'active_size': 0, 'segment_type': 'large', 'blocks': [{'size': 1566572544, 'state': 'inactive'}]}, {'device': 0, 'address': 139762664996864, 'total_size': 1566572544, 'allocated_size': 1564994048, 'active_size': 1564994048, 'segment_type': 'large', 'blocks': [{'size': 1564994048, 'state': 'active_allocated'}, {'size': 1578496, 'state': 'inactive'}]}, {'device': 0, 'address': 139764242055168, 'total_size': 1566572544, 'allocated_size': 0, 'active_size': 0, 'segment_type': 'large', 'blocks': [{'size': 1566572544, 'state': 'inactive'}]}, {'device': 0, 'address': 139765819113472, 'total_size': 1260388352, 'allocated_size': 1234429440, 'active_size': 1234429440, 'segment_type': 'large', 'blocks': [{'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 25958912, 'state': 'inactive'}]}, {'device': 0, 'address': 139767094181888, 'total_size': 1228931072, 'allocated_size': 1151934464, 'active_size': 1151934464, 'segment_type': 'large', 'blocks': [{'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 76996608, 'state': 'inactive'}]}, {'device': 0, 'address': 139768335695872, 'total_size': 1151336448, 'allocated_size': 1069772288, 'active_size': 1069772288, 'segment_type': 'large', 'blocks': [{'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 81564160, 'state': 'inactive'}]}, {'device': 0, 'address': 139769510100992, 'total_size': 1059061760, 'allocated_size': 987443712, 'active_size': 987443712, 'segment_type': 'large', 'blocks': [{'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 71618048, 'state': 'inactive'}]}, {'device': 0, 'address': 139770583842816, 'total_size': 383778816, 'allocated_size': 246819328, 'active_size': 246819328, 'segment_type': 'large', 'blocks': [{'size': 82328576, 'state': 'active_allocated'}, {'size': 81995776, 'state': 'inactive'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 54963712, 'state': 'inactive'}]}, {'device': 0, 'address': 139770986496000, 'total_size': 383778816, 'allocated_size': 246819328, 'active_size': 246819328, 'segment_type': 'large', 'blocks': [{'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 81995776, 'state': 'inactive'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 54963712, 'state': 'inactive'}]}, {'device': 0, 'address': 139771389149184, 'total_size': 383778816, 'allocated_size': 82328576, 'active_size': 82328576, 'segment_type': 'large', 'blocks': [{'size': 82328576, 'state': 'active_allocated'}, {'size': 301450240, 'state': 'inactive'}]}, {'device': 0, 'address': 139771791802368, 'total_size': 383778816, 'allocated_size': 246985728, 'active_size': 246985728, 'segment_type': 'large', 'blocks': [{'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'inactive'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 54630912, 'state': 'inactive'}]}, {'device': 0, 'address': 139772194455552, 'total_size': 383778816, 'allocated_size': 329147904, 'active_size': 329147904, 'segment_type': 'large', 'blocks': [{'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 54630912, 'state': 'inactive'}]}, {'device': 0, 'address': 139772932653056, 'total_size': 864026624, 'allocated_size': 740457984, 'active_size': 740457984, 'segment_type': 'large', 'blocks': [{'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'inactive'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 82162176, 'state': 'active_allocated'}, {'size': 82328576, 'state': 'active_allocated'}, {'size': 41406464, 'state': 'inactive'}]}, {'device': 0, 'address': 139775348572160, 'total_size': 121634816, 'allocated_size': 86241280, 'active_size': 86241280, 'segment_type': 'large', 'blocks': [{'size': 82328576, 'state': 'active_allocated'}, {'size': 3912704, 'state': 'active_allocated'}, {'size': 35393536, 'state': 'inactive'}]}, {'device': 0, 'address': 139775482789888, 'total_size': 121634816, 'allocated_size': 62600192, 'active_size': 62600192, 'segment_type': 'large', 'blocks': [{'size': 62600192, 'state': 'active_allocated'}, {'size': 59034624, 'state': 'inactive'}]}, {'device': 0, 'address': 139775617007616, 'total_size': 121634816, 'allocated_size': 82328576, 'active_size': 82328576, 'segment_type': 'large', 'blocks': [{'size': 82328576, 'state': 'active_allocated'}, {'size': 39306240, 'state': 'inactive'}]}, {'device': 0, 'address': 139776355205120, 'total_size': 121634816, 'allocated_size': 82162176, 'active_size': 82162176, 'segment_type': 'large', 'blocks': [{'size': 82162176, 'state': 'active_allocated'}, {'size': 39472640, 'state': 'inactive'}]}, {'device': 0, 'address': 139776489422848, 'total_size': 121634816, 'allocated_size': 82328576, 'active_size': 82328576, 'segment_type': 'large', 'blocks': [{'size': 82328576, 'state': 'active_allocated'}, {'size': 39306240, 'state': 'inactive'}]}, {'device': 0, 'address': 139776623640576, 'total_size': 121634816, 'allocated_size': 82328576, 'active_size': 82328576, 'segment_type': 'large', 'blocks': [{'size': 82328576, 'state': 'active_allocated'}, {'size': 39306240, 'state': 'inactive'}]}, {'device': 0, 'address': 139776757858304, 'total_size': 121634816, 'allocated_size': 82328576, 'active_size': 82328576, 'segment_type': 'large', 'blocks': [{'size': 82328576, 'state': 'active_allocated'}, {'size': 39306240, 'state': 'inactive'}]}, {'device': 0, 'address': 139776892076032, 'total_size': 121634816, 'allocated_size': 82162176, 'active_size': 82162176, 'segment_type': 'large', 'blocks': [{'size': 82162176, 'state': 'active_allocated'}, {'size': 39472640, 'state': 'inactive'}]}, {'device': 0, 'address': 139777772879872, 'total_size': 20971520, 'allocated_size': 16003584, 'active_size': 16003584, 'segment_type': 'large', 'blocks': [{'size': 5334528, 'state': 'active_allocated'}, {'size': 5334528, 'state': 'active_allocated'}, {'size': 1944576, 'state': 'inactive'}, {'size': 5334528, 'state': 'active_allocated'}, {'size': 3023360, 'state': 'inactive'}]}, {'device': 0, 'address': 139780438360064, 'total_size': 20971520, 'allocated_size': 9247232, 'active_size': 9247232, 'segment_type': 'large', 'blocks': [{'size': 5334528, 'state': 'active_allocated'}, {'size': 3912704, 'state': 'active_allocated'}, {'size': 11724288, 'state': 'inactive'}]}, {'device': 0, 'address': 139780459331584, 'total_size': 2097152, 'allocated_size': 278528, 'active_size': 278528, 'segment_type': 'small', 'blocks': [{'size': 2048, 'state': 'active_allocated'}, {'size': 2048, 'state': 'active_allocated'}, {'size': 33792, 'state': 'active_allocated'}, {'size': 1536, 'state': 'active_allocated'}, {'size': 25600, 'state': 'active_allocated'}, {'size': 313856, 'state': 'inactive'}, {'size': 25600, 'state': 'active_allocated'}, {'size': 1504768, 'state': 'inactive'}, {'size': 25600, 'state': 'active_allocated'}, {'size': 33792, 'state': 'active_allocated'}, {'size': 2048, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 2048, 'state': 'active_allocated'}, {'size': 2048, 'state': 'active_allocated'}, {'size': 2048, 'state': 'active_allocated'}, {'size': 1536, 'state': 'active_allocated'}, {'size': 1536, 'state': 'active_allocated'}, {'size': 2048, 'state': 'active_allocated'}, {'size': 2048, 'state': 'active_allocated'}, {'size': 33792, 'state': 'active_allocated'}, {'size': 33792, 'state': 'active_allocated'}, {'size': 1536, 'state': 'active_allocated'}, {'size': 25600, 'state': 'active_allocated'}, {'size': 1536, 'state': 'active_allocated'}, {'size': 1536, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}, {'size': 512, 'state': 'active_allocated'}]}]

Hi,
can you please try to change this alossE += loss_E.detach().item() to alossE += loss_E.item() or alossE += loss_E.detach().to("cpu").numpy()

1 Like

That is a very interesting change.
If I understand correctly you are suggesting that by doing detach() followed by .item() I actually leave the value in gpu memory and it somehow doesn’t get removed?

I will test it out over the weekend and see whether that fixes the problem.

1 Like

I think yes. Let me know when you try.

After the change you suggested, the code have been running without problems for several days, so thank you very much!

1 Like