F.conv2d stuck on my CentOS

Baron_Tsai · June 16, 2018, 7:41am

I run my pytorch code well on mac and even on windows system but the same code seems stuck on CentOS6.3.

I debug with ipdb, and found the code was stuck at F.conv2d function:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19285 work 20 0 2106m 906m 22m S 0.3 0.6 0:31.14 python

The running env was created with anaconda(python 2.7/3.6), pytorch version is 0.4.0.

I tried for a long time to resolve this problem and i tried. Do you have a suggestion? Thank you so much!

ptrblck · June 16, 2018, 5:15pm

Are you running your code on CPU or GPU, and multiprocessing?

Baron_Tsai · June 17, 2018, 1:28am

 for ii, (data, label) in tqdm(enumerate(train_dataloader)):
     input = Variable(data)
     target = Variable(label)
     optimizer.zero_grad()
     score = model(input) # stuck here 
     loss = criterion(score, target)
     loss.backward()
     optimizer.step()

On CPU, no multiprocessing i think…

Baron_Tsai · June 22, 2018, 2:19am

I reinstall CentOS6.3, and then upgrade glibc2.14, glibc2.17 due to the pytorch0.4.0 running error info.

Now everything is ok.

By the way, the pytorch0.3.1 perform well before i upgrade the glibc(up to 2.12). So i think the lastest pytorch0.4.0 may haven’t deal very well with glibc, leave running deadlock appearance and doesn’t tell any error and warning info, just stuck at F.conv2d in torch/nn/modules/conv.py(301).

Thank you all the same