Hi,
When I use pytorch with ipython notebook remotely in the cluster, and I used cpu. I find that, whatever the model is, vgg or a simple 2-layer cnn one, sometimes the code just stopped and the ipython notebook shows it’s still running but the “top” commands shows nothing is running.
No warning at all and I don’t think it’s because memory leak ( because I only uesed about 10% memory ).
( although when memory leaks, the code will stop running and no warning at all and the ipython notebook shows it’s still running.)
So what can I do? I mean, at least there should be some information to say why the code stops.
What do you mean?
Do you start the notebook server on a remote machine and detach from the session using screen?
Can you reconnect to the session using screen -r?
Ok, then it’s kind of tricky.
Do you have a small script, where the notebook crashes? Or does is crash randomly?
If it’s randomly, could you try to update jupyter etc.?
Ok, let me know, if it helps, since without a proper error message, it could be the notebook server or another application on the server. Also, could you tyr @SimonW’s suggestion, just to exclude possible error sources?
The error occurs again ( sorry that I haven’t update ipython notebook because it’s running) .
No error information at all and I use top,
PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20 0 17.009g 0.012t 124964 R 2000 19.0 8580:31 python
20 0 12.678g 9.038g 128752 S 0.0 14.4 76:30.30 python
it seems that the code is still running but cost 0 cpu. ( the machine has enough cpu and memory to run )
( The first line is the same code, but with different weight. The first runs well and the second seems strange).
Ok, in order to exclude potential code errors, could you export the python code from that notebook and run it in a terminal?
If it hangs again, we should have a look at deadlocks etc., if not it’s likely some IPython/notebook issue.
I am using a small cluster ( 40 thread ). This problem has no related with IPython notebook because I tried running the code without IPython notebook. And in general the code runs for several batches ( less than 10 ) and stops without any warning.
And this problem disappear after I restart the cluster.
I guess the problem may come from, either I run too many times of the nn.DataParallel, or I run too many times of pathos.multiprocessing. BTW, I am using Python 2.
Last, this problem has no related with model, at least I tried with different models.
It has happened to me earlier too. I tested the code on an older version of Torch and I got an error instead. Mine problem was related to passing torch Tensor to a function written to take normal Python numeric values.
I have since cast the tensors to int before passing to the function and so far it hasn’t been failing without warning.
When it happened, it happened whether GPU or CPU was used. It was happening on the the latest stable version of pyTorch 1.0 CUDA 10.0 on both Linux Docker and Windows.