Why data.cuda() is so slow? And the speed is related to what model you have used

When I tried to train AlexNet, ModelNet,ResNet, I find that it is too slow to move the training data from cpu to gpu by data.cuda(). And I also find that the speed of data.cuda() is related to what model is been training. For example, when I trained AlexNet, the speed of data.cuda() is twice as fast as ResNet. Does anyone know why?

How did you profile your code and came to this conclusion as it sounds wrong?

This is my code,and the model I have used is from torchvision.models, including AlexNet, ResNet18, ShuffleNet. I have also tried pin_memory in DataLoader, but the problem still exists.


torch.cuda.set_device(gpu)
model=model.cuda()

loader =torch.utils.data.DataLoader(dataset=dataset,batch_size=256,num_workers=56,shuffle=True)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
Loss = torch.nn.CrossEntropyLoss().cuda()

for ep in range(epoch):
    model.train()
    for batch, (x, y) in enumerate(loader):

        cuda_start_time=time.time()
        x=x.cuda()
        y=y.cuda()
        cuda_gap=time.time()-cuda_start_time

        optimizer.zero_grad()  
        output = model(x)
        train_loss = Loss(output, y)  
        train_loss.backward()  
        optimizer.step() 

Thanks for sharing the code! CUDA operations are executed asynchronously so you would need to synchronize the code via torch.cuda.synchronize() before starting and stopping the timers.

Thanks for you reply. But I can’t understand why it’s speed is about which model are running. What data.cuda() do is transferring data from CPU to GPU, why model can influence it’s speed.

Because you are not measuring the actual kernel time to transfer the data but “something else” due to the mentioned async execution.
If you don’t synchronize the code, the CPU will just run ahead and is able to start and stop the timer while the data is still being transferred. Depending on the actual use case, which model is used etc. you could hit an implicit synchronization point, which would then block the CPU and would thus also increase the reported time.
In summary: your profiling is wrong and is not measuring the transfer time.

I generally recommend to use e.g. torch.utils.benchmark for these types of unit tests, as it will add proper warmup iterations, will synchronize for you etc.

Thanks for your reply. I have learned many things. And I also have a question. When we just generally train our model,not do unit tests, we do not need to synchronize the code. But the end-to-end throughput may be influenced by this problem. So is there anything we can do to improve the end-to-end throughput. Thanks.

So far there is no problem as you are not measuring the desired operations.
Your timer is thus influenced by the actual model execution and will return wrong results.

Generally, the performance guide is a great resource to see how to speed up the training and inference.