Is emptying the cache a good practice when dealing with huge data from the Dataloader?

RKnowledge · July 24, 2023, 12:17am

Hi!

I wanted to ask if it’s a good practice or if there are better ways to free memory when using a Dataloader with a huge dataset.

For the same model, the same batch size I’m able to train on a smaller dataset so I’d like to know if there are any ways to leverage the bigger dataset.

Thank you!

ptrblck · July 24, 2023, 4:12am

Which cache exactly would you like to free and how would you do it?
Huge datasets are usually loaded lazily so I’m unsure which cache you are using.

RKnowledge · July 24, 2023, 7:09am

Thank you for your reply!

I am talking about the CUDA cache because I run of memory when using the bigger dataset, and it doesn’t happen with the smaller one.

Are huge datasets loaded lazily by default? Otherwise I’m not sure I have enable such behavior. For the moment I have the classic Dataset, Dataloader configuration and in the training loop I do all the loading and processing on the CPU then I send to the GPU using .to(device). The batches themselves are small in size, tensors of size (batch size, 700) but it is the Dataset that is big. (Huge length)

Maybe I’m doing it wrong and I should proceed otherwise for this case?

ptrblck · July 24, 2023, 7:14am

Unsure if I understand this claim correctly, but are you running our of GPU memory if you are loading more samples from your dataset into the host RAM or are you trying to load the data directly to the GPU? In the latter case, why don’t you move the batches to the device inside the DataLoader loop as is the common approach?

No, as the data loading logic is defined in your Dataset so you would need to check it and see how exactly the data is loaded and processed.

There is no “classic Dataset”, so you are either implementing a custom torch.utils.data.Dataset or use a built-in class such as ImageFolder.

This is the common approach so I’m unsure how the GPU memory should be changed by the size of the dataset.

To your original question: no, clearing the CUDA cache won’t help as the GPU memory usage is unrelated to the size of the dataset (it’s rather related to the batch size, the model, etc.).

RKnowledge · July 24, 2023, 7:38am

It is the first case, my training loop is basically this

for batch in data_loader:
   examples, labels = batch
   examples = examples.to(device)
   examples = torch.squeeze(examples)
   labels = labels.to(device)

Thank you, I’ll check how to change that then, but since it seems that the issue doesn’t come from here I’ll focus on other areas first.

Yeah you’re right, my bad sorry. I have a custom Dataset, and it is defined like this:

class SentenceCLS(Dataset):
  def __init__(self, data, labels, tokenizer):
    self.data = data
    self.labels = labels
    self.tokenizer = tokenizer

  def __len__(self):
    return len(self.data)

  def __getitem__(self, idx):
    if isinstance(idx, int):
      idx = [idx]
    if torch.is_tensor(idx):
      idx = idx.tolist()

    examples = [self.data[i] for i in idx]
    labels = [self.labels[i] for i in idx]

    examples = self.tokenizer(examples, padding=True, return_tensors="pt")['input_ids']
    return examples, labels

I also use a Sampler but I don’t know if it’s relevant as it only outputs lists of indices. I can put the code in here too but I didn’t want to surcharge the post.

Won’t PyTorch cache in the memory all the samples that are sent to the GPU for easier allocation later on? And if the dataset is bigger then it’ll cache in more samples.

Can I ask you please if it can be related to intermediate tensors in the model? The model is too big to put it is composed of many components, and some components contain residual connections so there are many forward loops of this type:

def forward(self, x):
   # x goes through a bunch of layers
   x = some_computations(x)

   # we use another variable for next layers while storing x for the residual connection
   y = some_computations2(x)

   # residual layer
   x = layer_norm_concat(x, y)

And it goes on. So I was wondering if these intermediate y tensors could be the issue.

ptrblck · July 24, 2023, 8:00am

The loop looks fine and matches the common approach.

Also the custom Dataset looks alright assuming the self.tokenizer processes samples on the CPU.

Sounds also correct as I again assume the indices are purely on the CPU.

PyTorch will push device memory which is free to the cache to be able to reuse it as it avoids calling the expensive cudaMalloc/Free calls. The host memory is irrelevant for it and in your example only the batch size would matter, since only a single batch is moved to the device. Once the input data is no longer needed, the memory will be moved the the cache and reused for the next allocation.
The size of the dataset won’t matter as you are just iterating more batches. The GPU memory usage would only be defined by the single batch, the model, intermediate activations, the optimizer etc., but is unrelated to the number of samples in the dataset.

Intermediates are stored if they are needed for the gradient computation. However, the size of these intermediate tensors (and thus also their GPU memory usage) is defined by the batch size and the model (or rather the actual operation performed in the layer).

RKnowledge · July 24, 2023, 8:13am

Thank you for your answers. Do you think the issue may come from having batches holding tensors of different sizes? For the same batch size, some batches have the size (batch_size, max_sequence_length1), others (batch_size, max_sequence_length2) …

It’s because in the huge dataset the variance in the distribution of sequence lengths is so huge I have decided to bin the sequences into categories of lengths so that the padding is more consistent. That’s why I’m using a Sampler.

I have understood from the following two quotes from your reply that the GPU memory usage would be only defined by the first single batch, but then during the loading when it meets batches of different shapes, is it safe to assume that PyTorch will case the first batch of the new sizes? If it is the case I think what explodes my GPU RAM is this, plus the intermediate the activations that will have different sizes now depending on the batch too (since after the embedding layer I’m using 3D tensors but only operating on the last dimension, the embedding dimension, so the middle dimension corresponding to sequence length and since this changes then the intermediate activations’ sizes change too).

ptrblck · July 24, 2023, 9:28am

Yes, your explanation makes sense and a variable input shape will also cause different memory requirements. A large sequence length in one batch could increase the memory usage significantly and could yield an out of memory error. You could try to check the min/max sequence lengths and maybe clip them to a shape which fits your setup and especially the GPU memory.

RKnowledge · July 24, 2023, 10:52am

Thank you a lot for your answers. I am able to fit for high batch sizes (1028) my maximum length sequences (around 800). For now I’ll stick to a small batch size so that I avoid this effect of having huge different intermediate activations cached and different batches sizes too.

Can I ask you one last question please, is there any way to the inner workings of the memory management please? I couldn’t understand well just from PyTorch’s documentation.
I have run the following loop on my smaller dataset, with a variable max sequence length, just to iterate fast and see how that will affect the allocated and cached memory:

print(torch.cuda.memory_allocated()/1024**2)
print(torch.cuda.memory_cached()/1024**2)
print()

for batch in data_loader:
  examples, labels = batch
  examples = torch.squeeze(examples)

  print(examples.size())

  examples = examples.to(device)

  print(torch.cuda.memory_allocated()/1024**2)
  print(torch.cuda.memory_cached()/1024**2)
  print("-----")

And I got the following

0.0
0.0
/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py:416: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  warnings.warn(

torch.Size([4092, 21])
0.65576171875
2.0
-----
torch.Size([4092, 18])
1.2177734375
2.0
-----
torch.Size([4092, 19])
1.1552734375
2.0
-----
torch.Size([4092, 21])
1.2490234375
2.0
-----
torch.Size([4092, 23])
1.3740234375
2.0
-----
torch.Size([4092, 30])
1.6552734375
2.0
-----
torch.Size([4092, 35])
2.02978515625
22.0
-----
torch.Size([357, 48])
1.06787109375
22.0
-----

I would love tu understand why the allocated memory decreased going from torch.Size([4092, 35]) to torch.Size([357, 48]) and I’d love to be able to compute by myself when the cached memory will increase (like why at torch.Size([4092, 35]) and not torch.Size([4092, 30])). (All batches have the same data type torch.int64)

Thank you a lot again, it feels satisfying to be able to pinpoint the issue ^^

ptrblck · July 24, 2023, 6:43pm

The batch size decreased by a large amount while the sequence length (or feature dimension) increased by a bit. Depending on the model architecture the decrease in memory would be expected and you would need to check what layers are used and how their memory usage is defined by the a) batch size and b) feature dimension input.

Some caching settings can be adjusted via PYTORCH_CUDA_ALLOC_CONF as described in the docs. Based on your use case the second allocation needs >=2MB and is thus allocating from the large pool as seen here:

print(torch.cuda.memory_allocated()/1024**2)
# 0.0
print(torch.cuda.memory_reserved()/1024**2)
# 0.0

# 1MB
x = torch.randn(1024*1024//4 * 1, device="cuda")
print(torch.cuda.memory_summary())
# ...
# |---------------------------------------------------------------------------|
# | Allocations           |       1    |       1    |       1    |       0    |
# |       from large pool |       0    |       0    |       0    |       0    |
# |       from small pool |       1    |       1    |       1    |       0    |
# |---------------------------------------------------------------------------|
# ...
print(torch.cuda.memory_allocated()/1024**2)
# 1.0
print(torch.cuda.memory_reserved()/1024**2)
# 2.0
del x

# 2MB
x = torch.randn(1024*1024//4 * 2, device="cuda")
print(torch.cuda.memory_summary())
# ...
# |---------------------------------------------------------------------------|
# | Allocations           |       1    |       1    |       2    |       1    |
# |       from large pool |       1    |       1    |       1    |       0    |
# |       from small pool |       0    |       1    |       1    |       1    |
# |---------------------------------------------------------------------------|
# ...
print(torch.cuda.memory_allocated()/1024**2)
# 2.0
print(torch.cuda.memory_reserved()/1024**2)
# 22.0

RKnowledge · July 24, 2023, 8:17pm

I see. Thank you for your answers and your help. Learned a lot from them and clarified my understanding about the PyTorch/CUDA interaction.