Performance degradation when GPU io and compute are parallel

miyano · February 23, 2020, 1:45pm

I am using distributeddataparallel for multi-GPU training using pytorch, and a process uses one GPU.
When I follow the normal formular in pytorch, the inference time is OK, like this.

x,y = next(train_loader)
x = x.cuda(rank)
y = y.cuda(rank)
t0 = time.time()
y1 = model(x)
torch.cuda.synchronize()
inference_time = time.time()-t0

But when I get the data from another thread, which always read data from train_loader and input it to a queue. The code is as following.

args.data_queue=queue.Queue()
def load_data_queue(rank, dataloader, args):
    n = 0
    while True:
        try:
            x,y = next(dataloader)
            x = x.cuda(rank)
            y = y.cuda(rank)
            args.data_queue.put([feature, label])
        except StopIteration:
            print('load queue quits normally')
            return
...
t = threading.Thread(target=load_data_queue, args=(
        rank, train_loader, args), daemon=True)
t.start()
...
x,y = args.data_queue.get()
t0 = time.time()
y1 = model(x)
#torch.cuda.synchronize()
inference_time = time.time()-t0

The inference_time will increase a lot.
To my understanding, GPU i/o should not influence GPU compute. What is causing this phenomenon？

ptrblck · February 24, 2020, 2:55am

You might run into the GIL in Python.
Wouldn’t the first DDP approach work or what are the shortcomings you are facing that need the second approach?

miyano · February 24, 2020, 8:59am

Thank you for your reply. We are doing some research which acquires GPU i/o and compute to run in parallel. And when we validated our idea, we observed this problem.

miyano · March 2, 2020, 10:59am

Thanks！I change thread parallel to process parallel，and the train performance becomes normal.