Performance degradation when GPU io and compute are parallel

I am using distributeddataparallel for multi-GPU training using pytorch, and a process uses one GPU.
When I follow the normal formular in pytorch, the inference time is OK, like this.

x,y = next(train_loader)
x = x.cuda(rank)
y = y.cuda(rank)
t0 = time.time()
y1 = model(x)
torch.cuda.synchronize()
inference_time = time.time()-t0

But when I get the data from another thread, which always read data from train_loader and input it to a queue. The code is as following.

args.data_queue=queue.Queue()
def load_data_queue(rank, dataloader, args):
    n = 0
    while True:
        try:
            x,y = next(dataloader)
            x = x.cuda(rank)
            y = y.cuda(rank)
            args.data_queue.put([feature, label])
        except StopIteration:
            print('load queue quits normally')
            return
...
t = threading.Thread(target=load_data_queue, args=(
        rank, train_loader, args), daemon=True)
t.start()
...
x,y = args.data_queue.get()
t0 = time.time()
y1 = model(x)
#torch.cuda.synchronize()
inference_time = time.time()-t0

The inference_time will increase a lot.
To my understanding, GPU i/o should not influence GPU compute. What is causing this phenomenon?

You might run into the GIL in Python.
Wouldn’t the first DDP approach work or what are the shortcomings you are facing that need the second approach?

Thank you for your reply. We are doing some research which acquires GPU i/o and compute to run in parallel. And when we validated our idea, we observed this problem.

Thanks!I change thread parallel to process parallel,and the train performance becomes normal.