I have a deep learning algorithm that I want to use on 2 gpus. I understand that the first gpu might be more utilized (Debugging DataParallel, no speedup and uneven memory allocation) but I think that all of the batch is initially distributed unevenly. I have a thread that preloads data onto the gpu that looks like this:
def load_to_gpu(self, cpu_batches):
for batch in cpu_batches:
batch = GpuVariable(batch) # GpuVariable calls _data_to_cuda(self.data) in init
self.processed_queue.put(batch)
The main thread does
def main(self):
while True:
self.model.train(self.processed_queue.get())
Under normal operation, the utilization looks like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 73C P0 128W / 149W | 4132MiB / 11439MiB | 84% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:00:05.0 Off | 0 |
| N/A 73C P0 122W / 149W | 1361MiB / 11439MiB | 67% Default |
+-------------------------------+----------------------+----------------------+
But when I don’t dequeue from processed_queue and let the batch sit on the gpu, I see this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 73C P0 74W / 149W | 6003MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:00:05.0 Off | 0 |
| N/A 51C P8 29W / 149W | 11MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
This suggests to me that the whole batch is put on gpu 0, and when I start doing compute-heavy tasks then half the batch is transferred to the other gpu. What I want is for the batch to be split up during the preloading process so I don’t have I/O overhead in my forward pass. Is there a way to do that given how DataParallel works?
As a side note if anyone has other ideas about why utilization is <95% that would be appreciated. My gpu utilization is less than 80% even on just 1 gpu