How to cache dataloader result

yjguo · October 14, 2021, 4:05pm

I have a dataset which is not big, but the torch.utils.data.DataLoader performance is not good on my system, so, I plan to cache the result of the first dataloader, and then use the cached results.

For speed, I don’t want to do torch.tensor copy during the cache processing, I tried with queue/list, but looks that the tensor copy is always there. Could you help on this? How to cache the result with reference, instead of full data copy? thanks.

The code looks like

val_loader = torch.utils.data.DataLoader(
    val_dataset,
    batch_size=16, shuffle=False,
    num_workers=8, pin_memory=True)

cached = [] #queue.Queue()
for i, (_input, _target) in enumerate(val_loader):
    cached.append((_input, _target))  // the performance drops a lot when this line added
    //do something

eqy · October 15, 2021, 12:59am

Are you concerned about data loading speed due to the files being on disk rather than in-memory? I believe that if there is enough main memory to cache the files, the filesystem will do this by default so any additional savings from caching the loaded tensor may be minimal. Furthermore, it seems this would require turning off shuffling (which you have correctly done), which may hurt model accuracy.

yjguo · October 15, 2021, 1:25am

thanks @eqy

I want to save all the time from disk read to decoder to image preprocessing, so to cache the dataloader’s final result (after preprocessing). It is for validation, so shuffle=false doesn’t matter.

I have some other findings.

there’s no data copy here, I verified it with data_ptr().
the cached.append time is very small, but it adds much in the total time, don’t know why.

start = time.time()
cache_time = 0
cached = [] #queue.Queue()
for i, (_input, _target) in enumerate(val_loader):
    cache_start = time.time()
    cached.append((_input, _target))
    cache_time += (time.time() - cache_start) * 1000

total_time = time.time() - start

print(cache_time)
print(total_time)

For dataset with 50000 images, the cache_time is ~0.3ms, but the total time increases from 15 seconds to 22 seconds after I add ‘cached.append((_input, _target))’ in the code.

If I change pin_memory from true to false, all the time are expected, the cache_time is still ~0.3ms, and the total time is still around 15 seconds.

I see the time from 15s to 22s is not a big problem, but still want to understand why? thanks.