Why does PyTorch use significant CPU when using to(device) and DataLoader(num_workers=1)?

I’m puzzled when I notice my PyTorch script using 600% CPU (viewed in top) when I’ve set tensors and model to GPU. Here’s what I’ve checked with .is_cuda:

  • Input sequence tensors
  • Input sequence length tensors
  • Target tensors
  • PackedSequence tensors
  • Initialized hidden tensors
  • Output tensors
  • Output hidden tensors
  • Loss

Also, my data loaders have num_workers = 1.

The GPU is being used (viewed in nvidia-smi -l 3)

Am I missing something? Why is there still so much CPU being used? I’m wondering if my pipeline is not working as intended.

There may be other things that becomes the bottleneck of training process such as disk IO if the dataset is large, so the GPU is idle all the time.