Non-blocking transfer to GPU is not working

I am having a problem getting .to(device) to work asynchronously. The training loop in the first code snippet below takes 3X longer than the second snippet. The first snippet sets pin_memory=True, non_blocking=True and num_workers=12. The second snippet moves tensors to the GPU in getitem and uses num_workers=0. Images that are being loaded are of shape [1, 512, 512]. The target is just a single float32.
Is there something I need to set in the CUDA drivers?
GPU: V100
PyTorch: 1.1.0
Python: 3.7.4
Cuda compilation tools, release 9.2, V9.2.148
conda version : 4.6.14
Ubuntu 16.04.5

# This is very slow. 
device = “cuda”
class MyDataset (Dataset):
        def __getitem__(self, idx):
              image = self.get_image_tensor(idx)
              target  = self.get_target(idx)
              return {"images": image, "targets": target}

train_dataset  = MyDataset()
train_loader = DataLoader(
               train_dataset,
               batch_size=16,
               shuffle=True,
               num_workers=12,
               pin_memory=True)

def train():
	for batch in train_loader:
        images = batch[“images”].to(device, non_blocking=True)
        targets = batch[“targets”].to(device, non_blocking=True)
# This is faster, but still slower than it should be.
device = “cuda”
class MyDataset (Dataset):
      def __getitem__(self, idx):
            image = self.get_image_tensor(idx).to(device)
            target  = self.get_target(idx).to(device)
            return {"images": image, "targets": target}

train_dataset  = MyDataset()
train_loader = DataLoader(
               train_dataset,
               batch_size=16,
               shuffle=True,
               num_workers=0,
               pin_memory=False)

def train():
      for batch in train_loader:
          images = batch[“images”]
          targets = batch[“targets”]

Hi,
I think it’s expected that there is an overhead of using multiple workers if your tensors are very cheap to load. The point is to be able to load from disk ahead of time to reduce the impact of the slow hard drives.

Why is the second one “slower than it should be” ?

I meant to say that it would be much faster if the multi-worker version worked properly.
In the first snippet, GPU utilization is very low. The call to .to(device) starts only after the current batch being processed by the GPU is done which causes a lot of GPU idle time. The same implementation runs much faster in TensorFlow.

We had some issues using pinned memory recently (@rwightman reported it here), which were fixed recently, so you could try out the nightly build or build from source and check, if the profiling changes.
For general data loading bottlenecks, have a look at this post.