Transferring data to gpu takes time and behaves weirdly

I am facing some weird time related issues while transferring data from cpu to gpu.

images1, labels1 = dataset.load_data(n_similar=args.num_similar)
print('1 ' + str(time.time()-t))

images = torch.from_numpy(images1).float().to(device)
print('2a ' + str(time.time()-t))
labels = torch.from_numpy(labels1).float().to(device)
print('2b ' + str(time.time()-t))

This prints the following:
1 0.18978023529052734
2a 0.3837716579437256
2b 0.38383913040161133

However, changing the order of sending images and labels -> labels and then images does not change the time required, i.e., the following code

images1, labels1 = dataset.load_data(n_similar=args.num_similar)
print('1 ' + str(time.time()-t))

labels = torch.from_numpy(labels1).float().to(device)
print('2b ' + str(time.time()-t))
images = torch.from_numpy(images1).float().to(device)
print('2a ' + str(time.time()-t))

produced the following output:
1 0.1595933437347412
2b 0.3884453773498535
2a 0.3891716003417969

Moreover, the following code

images1, labels1 = dataset.load_data(n_similar=args.num_similar)
print('1 ’ + str(time.time()-t))

images = torch.from_numpy(images1).float().to(device)
print('2a ' + str(time.time()-t))
labels = torch.from_numpy(labels1).float().to(device)
print('2b ' + str(time.time()-t))


images2 = torch.from_numpy(images1 + np.random.randn(10,3,224,224)).float().to(device)
print('2c ' + str(time.time()-t))
labels2 = torch.from_numpy(labels1*2).float().to(device)
print('2d ' + str(time.time()-t))

produces the following output
1 0.14516925811767578
2a 0.38118934631347656
2b 0.3812577724456787
2c 0.46235036849975586
2d 0.4624507427215576

The above outputs are not just for the first iteration, but we can see similar time outputs over all iterations.
Used this install pytorch: conda install pytorch torchvision cuda80 -c soumith
Running on Tesla K80.

Anyone who can help me to understand this and possibly reduce the amount of time required to transfer from CPU to GPU?

Since CUDA calls are asynchronous, you might be just measuring the time to launch kernels.
Although I’m currently not sure if the to() op is asynchronously run be default.
However, add a synchronization before starting and stopping the timer:

torch.cuda.synchronize()
t0 = time.time()
...
torch.cuda.synchronize()
t1 = time.time()

If you are using a DataLoader, you might set pin_memory=True to use pinned memory, which will speed up the transfer between CPU and the GPU. Here is a good explanation of what’s going on under the hood.

1 Like