Maximizing CNN inference throughput on GPU

Hi, I am trying to maximize inference throughput on a GPU for my undergrad thesis analysis. On TensorFlow, I managed to “overlap” data loading and GPU computation. However, I can’t seem to do this on PyTorch.

Essential code:

my_net = torch.load(some_file)
my_net.cuda()

 class MyDataset(Dataset):
     def __len__(self):
         return 100000
     def __getitem__(self,idx):
         image = np.random.randn(3,224,224).astype(np.float32)
         return torch.from_numpy(image)

 def run_inference(imgs):
     batch = Variable(imgs,volatile=True)
     r = my_net.forward(batch.cuda(async=True))
     r.cpu()

 my_dataset = MyDataset()
 dataset_loader = torch.utils.data.DataLoader(my_dataset,
                                                  batch_size=FLAGS.batch_size,
                                                  shuffle=False,
                                                  num_workers=8,
                                                  pin_memory=True)

 dataset_iter = iter(dataset_loader)

 for i, data in enumerate(dataset_loader):
     run_inference(data)

For instance, I’m trying alexnet inference with batch size of 128 and I am getting 70ms per batch, out of which ~13ms is moving data from host to GPU. Would it be possible to avoid this by overlapping data copying and GPU computations? How?

Thanks!