Slow data loading-perhaps by CUDA cache?

My pytorch program is working in a slow way. I determined the bottleneck through this code:

from timeit import default_timer as timer
xxx(some codes here)

After running this code, I found the bottleneck:

    for batch_idx, (x1, x2, y) in enumerate(train_loader.get_augmented_iterator(
        x1 = torch.Tensor(x1).to(device)  # slow in this line of code
        x1 = x1.transpose(1, 3)

Where get_augmented_iterator is some function to load the data I defined. In the first line, “x1 = torch.Tensor(x1).to(device)”, my program takes ~0.4s to execute this. The get_augmented_iterator() takes about ~0.1s to finish and includes some steps I have to preprocess it in this stage.

While the problem is certainly not in this line itself, I researched a little bit and found that if I add torch.cuda.empty_cache(), this line execute in normal time. However, torch.cuda.empty_cache() itself takes ~0.4s to execute, so I didn’t really solve this problem, but cache itself must be the problem.

I tried on another several projects where very similar dataloader were used, However, I didn’t reproduce my problem on their codes. So my question is: how can it possibly relate to any errors in my code and how can I solve it?

Here is an update: I ran several more times and found that if I comment the “loss.backward()” line, the speed will be normal. Given that I have to keep this line for sure, how to make the training speed normal?

I believe it must be the data somehow overflowed, so pytorch have to clear the cache at every iteration. But why and how it caused this problem?

CUDA operations are executed asynchronously, so you need to synchronize the code manually before starting and stopping the timers via torch.cuda.synchronize().
Based on your current description, you are “moving” the accumulated time from one operation to the next blocking one by commenting them out.
If you don’t synchronize the code, the next blocking operation will accumulate all previously executed (async) operations.