It looks like the time to copy one batch of data from cpu to gpu varies according to model size or maybe inference time. If model size or inference time is large, time taken is larger. I can’t understand why I am seeing this behavior.
Below is the code to reproduce and results:
import time
import torch
import torch.utils.data as data_utils
import torchvision.models as models
train_data = torch.randn(800, 3, 640, 640)
train_labels = torch.ones(800).long()
train = data_utils.TensorDataset(train_data, train_labels)
train_loader = data_utils.DataLoader(train, batch_size=2, shuffle=True)
#model = models.densenet161().cuda() # OPTION 1
model = models.resnet18().cuda() # OPTION 2
for x,y in train_loader:
st = time.time()
x, y = x.cuda(), y.cuda()
print(time.time()-st)
pred = model(x)
As you can see the code prints out the time taken to copy x and y to GPU.
Resnet :
0.0025365352630615234
0.006475925445556641
0.009412765502929688
0.008988618850708008
0.009799957275390625
0.009394407272338867
0.009585857391357422
0.00932931900024414
0.009220361709594727
Densenet :
0.046151161193847656
0.04446148872375488
0.041242122650146484
0.03512907028198242
0.03663516044616699
0.03776717185974121
0.03576469421386719
0.03654170036315918
0.03631234169006348
0.03648805618286133
Also the architectures need not really be different to reproduce this.
Example: Just do the forward pass multiple times on resnet:
model = models.resnet18().cuda()
for x,y in train_loader:
st = time.time()
x, y = x.cuda(), y.cuda()
print(time.time()-st)
pred = model(x)
pred = model(x)
pred = model(x)
pred = model(x)
Outputs:
0.03314852714538574
0.03551316261291504
0.03101205825805664
0.031397342681884766
0.03133535385131836
0.03079056739807129
0.025241851806640625
0.029177427291870117
0.029461145401000977
0.028394699096679688
So more than 10 times compared to when you do just one inference.
On a different machine with a different GPU but the same pytorch version (1.3.1), this behavior is not reproducible with batch_size=1, but it is reproducible with batch_size=2. Also the input size seems to matter.
With
train_data = torch.randn(800, 3, 224, 224)
and batch_size=4 this behavior is not reproducible but with batch_size=16 it is.
I’m not really sure what is happening here. Even if this is related to asynchronicity shouldn’t a larger inference time ensure that the data is already copied in the background?
PyTorch version used is 1.3.1