Hi, I am trying to maximize inference throughput on a GPU for my undergrad thesis analysis. On TensorFlow, I managed to “overlap” data loading and GPU computation. However, I can’t seem to do this on PyTorch.
my_net = torch.load(some_file) my_net.cuda() class MyDataset(Dataset): def __len__(self): return 100000 def __getitem__(self,idx): image = np.random.randn(3,224,224).astype(np.float32) return torch.from_numpy(image) def run_inference(imgs): batch = Variable(imgs,volatile=True) r = my_net.forward(batch.cuda(async=True)) r.cpu() my_dataset = MyDataset() dataset_loader = torch.utils.data.DataLoader(my_dataset, batch_size=FLAGS.batch_size, shuffle=False, num_workers=8, pin_memory=True) dataset_iter = iter(dataset_loader) for i, data in enumerate(dataset_loader): run_inference(data)
For instance, I’m trying alexnet inference with batch size of 128 and I am getting 70ms per batch, out of which ~13ms is moving data from host to GPU. Would it be possible to avoid this by overlapping data copying and GPU computations? How?