Issue with DataParallel model and multithreaded data loading

SelvamArul · July 1, 2017, 10:35am

I am using DataParallel model and multithreading for loading the input and ground truth. I am using python queue for synchronizing between the main thread and the loader threads. The code looks like the following pseudocode.

import queue
Q_dirs = queue.Queue()
Q_tensors = queue.Queue( maxsize=4 )
model = th.nn.DataParallel(model,  device_ids=cuda_devices, output_device=cuda_devices[0] ).cuda()
def loader():
	global Q_dirs
	global Q_tensors
	while True:
		list_dirs = Q_dirs.get()
		tensor_data = {}
		# for dir in list_dirs read tensor and load in tensor_data.
		# tensor_data['features'] = features
		# tensor_data['q_gt'] = q_gt 
		Q_tensors.put(tensor_data)

# in main thread
for i in range(no_of_batches):
	for j in range( current_batch_size ):
		list_dirs.append ( dirs[ i*bs + j ] )

	Q_dirs.put(list_dirs)

for i in bar(range(no_of_batches)):
	tensor_data = Q_tensors.get()
	features = tensor_data['features']
	q_gt = tensor_data['q_gt']
	input = Variable( features, requires_grad=False )
	q_gt = Variable( q_gt, requires_grad=False)
	q_pred = model(input)
loaders = []
for i in range(4):
	loaders.append( T.Thread(name='L'+str(i), target=loader) )

for l in loaders:
	l.start()

But the Dataparallel model hangs in the forward pass of the model. At the point when the model hangs, I could see only one of the GPU cards (the card that I set as output_device when I create the DataParallel) has 100% utilization and all other cards have 0% utilization. This happens at variuos epochs and not consistant always. I suspect that there is wrong race condiiton in my code. But given that I am using only thread-safe python 3.5 Queues for synchrnizatoin, I am puzzled about the issue. Could someone throw some light on what is going on here?

Some additional information about my system.

OS: Ubuntu 16.04.2 LTS
python: 3.5
CUDA: 8.0
PyTorch version: 0.1.11_4

smth · July 2, 2017, 1:51pm

upgrade pytorch to latest version
when using CUDA with multiprocessing, you have to use the spawn or forkserver method as mentioned in http://pytorch.org/docs/master/notes/multiprocessing.html#sharing-cuda-tensors

I think (2) is the reason for your deadlock.

SelvamArul · July 3, 2017, 9:24am

@smth Thanks for your suggestions. I am upgrading PyTorch to the latest version and will try the code once again. But just for clarification, I am not using multiprocessing. I using multi-threading. I have 4 threads(updated the pseudocode I posted to reflect this) that loads the next batches while the network is processing the current batch. I am doing this to avoid the time the network has to wait while the data is being loaded. And as I said, I am using queue.Queue which is thread-safe to synchronize between loading threads and the main thread(the network thread).

SelvamArul · July 4, 2017, 5:54pm

@smth After updating PyTorch to the latest version, the issue is not appearing. But I am not very confident that the issue will never show up again. I will use the network frequently in the next days and I will update if I see the issue again. Thanks for your suggestions

SelvamArul · July 9, 2017, 10:43am

@smth I managed to pinpoint the actual cause of the issue. It was a silly mistake in my code. In the loader thread, I was loading the ground truth as cuda tensor. This was causing the deadlock when the main thread was holding the lock on GPUs while training. Now I fixed it.