Pytorch stop without any warning

Hi,
When I use pytorch with ipython notebook remotely in the cluster, and I used cpu. I find that, whatever the model is, vgg or a simple 2-layer cnn one, sometimes the code just stopped and the ipython notebook shows it’s still running but the “top” commands shows nothing is running.

No warning at all and I don’t think it’s because memory leak ( because I only uesed about 10% memory ).

( although when memory leaks, the code will stop running and no warning at all and the ipython notebook shows it’s still running.)

So what can I do? I mean, at least there should be some information to say why the code stops.

Tell me if you need any information.

Do you get any error messages in the terminal, where the jupyter notebook server was started?

Hi,

No. But I am using "screen" so that I can't roll back.

What do you mean?
Do you start the notebook server on a remote machine and detach from the session using screen?
Can you reconnect to the session using screen -r?

Yeah, I reconnect to the seesion but I cant roll back the terminal so that I can’t see the information earlier.

As far as I saw, no error information.

Ok, then it’s kind of tricky.
Do you have a small script, where the notebook crashes? Or does is crash randomly?
If it’s randomly, could you try to update jupyter etc.?

screen seems support scrolling in someway: https://unix.stackexchange.com/a/40243

disclaimer I haven’t read or tried the solution in the SO link.

I just run the VGG16 default in the pytorch, nothing specially.

And it’s randomly because it works well for the most of the time.

I will update that. Thanks.

Ok, let me know, if it helps, since without a proper error message, it could be the notebook server or another application on the server. Also, could you tyr @SimonW’s suggestion, just to exclude possible error sources?

Thanks! also thank you for @SimonW .

I will try provide information when I meet these problem again.

The error occurs again ( sorry that I haven’t update ipython notebook because it’s running) .

No error information at all and I use top,

PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20 0 17.009g 0.012t 124964 R 2000 19.0 8580:31 python
20 0 12.678g 9.038g 128752 S 0.0 14.4 76:30.30 python

it seems that the code is still running but cost 0 cpu. ( the machine has enough cpu and memory to run )
( The first line is the same code, but with different weight. The first runs well and the second seems strange).

Ok, in order to exclude potential code errors, could you export the python code from that notebook and run it in a terminal?
If it hangs again, we should have a look at deadlocks etc., if not it’s likely some IPython/notebook issue.

Hope the following is helpful:

I am using a small cluster ( 40 thread ). This problem has no related with IPython notebook because I tried running the code without IPython notebook. And in general the code runs for several batches ( less than 10 ) and stops without any warning.

And this problem disappear after I restart the cluster.

I guess the problem may come from, either I run too many times of the nn.DataParallel, or I run too many times of pathos.multiprocessing. BTW, I am using Python 2.

Last, this problem has no related with model, at least I tried with different models.

For how long does this problem disappear?
Do you see any patterns when it happens?

Still, you don’t get a proper error? The script just hangs and that’s it?

Could you create a conda environment with Python 3.6 and try it again?

For how long does this problem disappear?
After restart the cluster.

Do you see any patterns when it happens?
No

Still, you don’t get a proper error? The script just hangs and that’s it?
No error. The script still takes the memory but 0 cpu.

Could you create a conda environment with Python 3.6 and try it again?
ok

To me happens when using python3.6 using 2 or 4 RTX2080 gpus.
the script can either work fine for many epochs or stop during the first.

It has happened to me earlier too. I tested the code on an older version of Torch and I got an error instead. Mine problem was related to passing torch Tensor to a function written to take normal Python numeric values.

I have since cast the tensors to int before passing to the function and so far it hasn’t been failing without warning.

When it happened, it happened whether GPU or CPU was used. It was happening on the the latest stable version of pyTorch 1.0 CUDA 10.0 on both Linux Docker and Windows.