How to solve the data transfer bottleneck from CPU to GPU?

Hi,

When I check my GPU usage, I found that GPU idle is often occurred, which makes the simulation slow down.
Especially, GPU idle comes from the data transfer from CPU to GPU:

  • videos = videos.cuda()
  • questions = questions.cuda()
  • answers = answers.cuda()

The main forward & backward computation took around 0.55 seconds, but the data transfer took 0.25 seconds…
I tried to solve this problem by using cuda() operation inside the customized collate function:

  • def SeqCollate(batch):
    • videos, questions, answers = zip(*batch)
      videos = torch.cat(videos, 0)
      questions = torch.cat(questions, 0)
      answers = torch.cat(answers, 0)
      return (videos.cuda(), questions.cuda(), answers.cuda())

But, I got the following errors:
“RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method”

Could you kindly let me know how can I solve the data transfer bottleneck from CPU to GPU?

Thanks for reading my question!