Batch loading on DataParallel initially all on one gpu

I have a deep learning algorithm that I want to use on 2 gpus. I understand that the first gpu might be more utilized (Debugging DataParallel, no speedup and uneven memory allocation) but I think that all of the batch is initially distributed unevenly. I have a thread that preloads data onto the gpu that looks like this:

def load_to_gpu(self, cpu_batches):
  for batch in cpu_batches:
      batch = GpuVariable(batch) # GpuVariable calls _data_to_cuda(self.data) in init
      self.processed_queue.put(batch)

The main thread does

def main(self):
    while True:
        self.model.train(self.processed_queue.get())

Under normal operation, the utilization looks like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0   128W / 149W |   4132MiB / 11439MiB |     84%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:05.0 Off |                    0 |
| N/A   73C    P0   122W / 149W |   1361MiB / 11439MiB |     67%      Default |
+-------------------------------+----------------------+----------------------+

But when I don’t dequeue from processed_queue and let the batch sit on the gpu, I see this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |   
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 | 
| N/A   73C    P0    74W / 149W |   6003MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:05.0 Off |                    0 | 
| N/A   51C    P8    29W / 149W |     11MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
    
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |   
|=============================================================================|
+-----------------------------------------------------------------------------+

This suggests to me that the whole batch is put on gpu 0, and when I start doing compute-heavy tasks then half the batch is transferred to the other gpu. What I want is for the batch to be split up during the preloading process so I don’t have I/O overhead in my forward pass. Is there a way to do that given how DataParallel works?

As a side note if anyone has other ideas about why utilization is <95% that would be appreciated. My gpu utilization is less than 80% even on just 1 gpu

Would you please provide more information on how you define your models and what you mean by this method I’m unaware of called GpuVariable() ?

I would recommand using this bit of code to cast a torch tensor to gpu memory :

batch = Variable(batch.cuda())