Code runs faster on local GPU than cluster

My code runs so much faster on my single GTX 1060 than on the cluster which has 2 GTX 1080 Ti. This is strange because a few parts of the code run faster on the cluster, although most parts run slower. For some comparison, enumerating through the dataloader takes ~9 seconds on my local while it takes 550 seconds on the cluster. Also, calculating loss + backprop takes ~ 1 seconds on my local, while it takes 10 seconds on the cluster. The code is from this paper https://github.com/Philip-Bachman/amdim-public

How did you measure the timing in both applications?
If the data loading is 55 times slower on the “cluster”, I would recommend to narrow down this issue first.
E.g. are you storing the data on a network drive or on a local SSD in both cases?

I just called time.time() at different sections and computed their differences for timing. Can you clarify how to narrow down the issue? The data is stored on my local SSD, and I’m not exactly sure where on the network it’s stored. The original code used ImageFolder to load the data, and I tried changing it to a standard dataset, but this did not help

If you are timing CUDA operations you would have to synchronize the code before starting and stopping the timer due to the asynchronous execution of CUDA kernels via torch.cuda.synchronize().

I don’t understand this explanation. Is the data stored on the SSD in your workstation or on a server in your network (or both)?
In the latter case you would introduce the network latency into the training, so I would recommend to store the data on a local SSD.

What do you mean by “standard Dataset”? Did you write a custom Dataset?

When I run on my local machine, I’m accessing the data on my local SSD. I copy all code/data over to a storage on my kubernetes pod, then run the code there, using the cluster’s GPUs. I think that in both cases, they are accessing the data from their local SSDs.