A case of exploding RAM while using pinned memory (probably)

ayhyap · March 17, 2022, 8:10pm

Just wanted to document an issue I encountered that I couldn’t find online, and how I resolved it.

tl;dr if you are collecting the labels (or any tensor) from batches while iterating through a dataset using a dataloader with pinned memory enabled, use a copy of it with clone() instead of the original.

In other words…

pred_list = []
label_list = []
for batch in train_loader:
    '''
    training code here
    '''
    pred_list.append(preds.cpu().detach())
    # label_list.append(batch['labels']) # bad
    label_list.append(batch['labels'].clone()) # better

What I think is happening: the label tensors that come directly from the batch occupy pinned memory and cannot be moved to virtual memory, leading to RAM usage increasing as they are collected into the label list.

In my case, RAM usage was climbing over 20GB before the fix (and causing thrashing), and stayed at 5GB with the fix.