Pytorch stop without any warning

li0218 · April 15, 2018, 7:07am

Hi,
When I use pytorch with ipython notebook remotely in the cluster, and I used cpu. I find that, whatever the model is, vgg or a simple 2-layer cnn one, sometimes the code just stopped and the ipython notebook shows it’s still running but the “top” commands shows nothing is running.

No warning at all and I don’t think it’s because memory leak ( because I only uesed about 10% memory ).

( although when memory leaks, the code will stop running and no warning at all and the ipython notebook shows it’s still running.)

So what can I do? I mean, at least there should be some information to say why the code stops.

Tell me if you need any information.

ptrblck · April 15, 2018, 2:39pm

Do you get any error messages in the terminal, where the jupyter notebook server was started?

li0218 · April 15, 2018, 6:45pm

Hi,

No. But I am using "screen" so that I can't roll back.

ptrblck · April 15, 2018, 7:09pm

What do you mean?
Do you start the notebook server on a remote machine and detach from the session using screen?
Can you reconnect to the session using screen -r?

li0218 · April 15, 2018, 7:11pm

Yeah, I reconnect to the seesion but I cant roll back the terminal so that I can’t see the information earlier.

As far as I saw, no error information.

ptrblck · April 15, 2018, 7:20pm

Ok, then it’s kind of tricky.
Do you have a small script, where the notebook crashes? Or does is crash randomly?
If it’s randomly, could you try to update jupyter etc.?

SimonW · April 15, 2018, 7:33pm

screen seems support scrolling in someway: https://unix.stackexchange.com/a/40243

disclaimer I haven’t read or tried the solution in the SO link.

li0218 · April 15, 2018, 8:58pm

I just run the VGG16 default in the pytorch, nothing specially.

And it’s randomly because it works well for the most of the time.

I will update that. Thanks.

ptrblck · April 15, 2018, 9:03pm

Ok, let me know, if it helps, since without a proper error message, it could be the notebook server or another application on the server. Also, could you tyr @SimonW’s suggestion, just to exclude possible error sources?

li0218 · April 16, 2018, 1:01am

Thanks! also thank you for @SimonW .

I will try provide information when I meet these problem again.

li0218 · April 17, 2018, 10:09pm

The error occurs again ( sorry that I haven’t update ipython notebook because it’s running) .

No error information at all and I use top,

PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20 0 17.009g 0.012t 124964 R 2000 19.0 8580:31 python
20 0 12.678g 9.038g 128752 S 0.0 14.4 76:30.30 python

it seems that the code is still running but cost 0 cpu. ( the machine has enough cpu and memory to run )
( The first line is the same code, but with different weight. The first runs well and the second seems strange).

ptrblck · April 17, 2018, 10:41pm

Ok, in order to exclude potential code errors, could you export the python code from that notebook and run it in a terminal?
If it hangs again, we should have a look at deadlocks etc., if not it’s likely some IPython/notebook issue.

li0218 · April 30, 2018, 6:20am

Hope the following is helpful:

I am using a small cluster ( 40 thread ). This problem has no related with IPython notebook because I tried running the code without IPython notebook. And in general the code runs for several batches ( less than 10 ) and stops without any warning.

And this problem disappear after I restart the cluster.

I guess the problem may come from, either I run too many times of the nn.DataParallel, or I run too many times of pathos.multiprocessing. BTW, I am using Python 2.

Last, this problem has no related with model, at least I tried with different models.

ptrblck · April 30, 2018, 9:29am

For how long does this problem disappear?
Do you see any patterns when it happens?

Still, you don’t get a proper error? The script just hangs and that’s it?

Could you create a conda environment with Python 3.6 and try it again?

li0218 · April 30, 2018, 4:22pm

For how long does this problem disappear?
After restart the cluster.

Do you see any patterns when it happens?
No

Still, you don’t get a proper error? The script just hangs and that’s it?
No error. The script still takes the memory but 0 cpu.

Could you create a conda environment with Python 3.6 and try it again?
ok

cristinasegalin · February 26, 2019, 7:37pm

To me happens when using python3.6 using 2 or 4 RTX2080 gpus.
the script can either work fine for many epochs or stop during the first.

xian_kgx · February 28, 2019, 3:09am

It has happened to me earlier too. I tested the code on an older version of Torch and I got an error instead. Mine problem was related to passing torch Tensor to a function written to take normal Python numeric values.

I have since cast the tensors to int before passing to the function and so far it hasn’t been failing without warning.

When it happened, it happened whether GPU or CPU was used. It was happening on the the latest stable version of pyTorch 1.0 CUDA 10.0 on both Linux Docker and Windows.