I am training a huggingface LayoutLMV2 model and I have a custom Dataset class that I call using a dataloader with batch/random sampler.
My dataset loads all relevant info that can stay in memory (labels, paths, etc) and all the main processing happpens in
__getitem__ function for relevant index of data (Loading image, calling the huggingface tokenizer/processor etc).
If my dataset is smaller (few thousand data points), things work normally, I see almost no change in RAM used, but for large data (30-40 thousand files), loading the initial data works fine, only difference is the RAM consumed (more consumed as more datapoints) but while iterating over the dataloader, the RAM consumption gradually increases.
I have removed the model from the pipeline and the processor in
__getitem__ (I basically just loop over the dataloader) and this still happens till the process gets killed because memory gets filled. My training happens on GPU and GPU memory stays normal while the RAM fills up. This happens over a few thousand steps, which doesn’t happen with smaller datasets.
What could be the reason for this? What could be a way around this?
I will see if I can share a reproducible example.