I
I am facing a thread deadlock issue when I use multiple GPUs with DataParallel(). The model is training on a medium-size dataset with 240K training samples. The model successfully trains for one epoch. In the second epoch, the training progresses smoothly till it reaches 50%. After that, it is simply stuck with no progress. When I kill the process using ctrl+c or kill -s SIGKILL, it becomes a zombie process!
Here is what I get when I do ctrl+c
File "run_E2E_EL_RE.py", line 962, in <module>
main()
File "run_E2E_EL_RE.py", line 913, in main
global_step, tr_loss = train(args, model, tokenizer)
File "run_E2E_EL_RE.py", line 249, in train
el_loss, re_loss, _, _ = model.forward(**ned_inputs)
File "/dresden/users/rb897/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/dresden/users/rb897/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/dresden/users/rb897/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
thread.join()
File "/dresden/users/rb897/anaconda3/lib/python3.7/threading.py", line 1044, in join
self._wait_for_tstate_lock()
File "/dresden/users/rb897/anaconda3/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
Code Snippet:
model.zero_grad()
train_iterator = trange(
epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
)
set_seed(args) # Added here for reproducibility
for epoch_num in train_iterator:
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
for step, batch in enumerate(epoch_iterator):
model.train()
batch = tuple(t.to(args.device) for t in batch)
inputs_1 = {...}
inputs_2 = {...}
loss_1, last_hidden_states = model.forward(**inputs_1)
inputs_2["last_hidden_states"] = last_hidden_states
loss_2, loss_3 = model.forward(**inputs_2)
if args.n_gpu > 1: # mean() to average on multi-gpu parallel training
loss_1 = loss_1.mean()
loss_2 = loss_2.mean()
loss_3 = loss_3.mean()
loss = loss_1 + loss_2 + loss_3
loss.backward()
optimizer.step()
scheduler.step()
model.zero_grad()
per gpu batch size = 4
OS: Ubuntu 18.04.5
CPU: Intel Xeon® - 64 cores
CUDA Version: 11.2
GPU: Quadro RTX 6000
GPU memory: 24G
PyTorch: 1.4.0