CUDA Error when doing multiprocessing on CPU

ArmenAg · July 30, 2017, 9:29pm

I am trying to run the A3C algorithm using the code provided here: GitHub - ikostrikov/pytorch-a3c: PyTorch implementation of Asynchronous Advantage Actor Critic (A3C) from "Asynchronous Methods for Deep Reinforcement Learning".

After fixing some trivial bugs in the code that deal with expand_as, I get an error thrown:

[2017-07-30 21:27:20,583] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,900] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,905] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,915] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,922] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,933] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,938] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,950] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,950] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,982] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,973] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,999] Making new env: PongDeterministic-v4
[2017-07-30 21:27:21,022] Making new env: PongDeterministic-v4
[2017-07-30 21:27:21,029] Making new env: PongDeterministic-v4
[2017-07-30 21:27:21,073] Making new env: PongDeterministic-v4
[2017-07-30 21:27:21,077] Making new env: PongDeterministic-v4
[2017-07-30 21:27:21,107] Making new env: PongDeterministic-v4
[2017-07-30 21:27:21,122] Making new env: PongDeterministic-v4
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called recursively
terminate called after throwing an instance of ‘std::runtime_error’
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called recursively
terminate called after throwing an instance of ‘std::runtime_error’
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error

The code I am referencing does not make any calls to CUDA, so why would I be getting this error?

Thanks.

Peng_Wang · August 9, 2017, 6:28am

We come across the same problem, any solution to this?

atgambardella · August 9, 2017, 12:29pm

I’ve run into the same issue. Another thing is that pytorch seems to be allocating space on the GPU, despite moving no tensors there

ArmenAg · August 14, 2017, 5:29pm

I still have not found a solution to this.

Crazyai · August 17, 2017, 3:46am

The same error. Have you found the solution? It is strange that when I run the same python script on another computer with the same configuration(CUDA8.0 + CUDNN6 + Python2.7, but different OS: Ubuntu14.04 vs Ubuntu16.04), it works well.

Crazyai · August 17, 2017, 4:13am

It seems that this error occurs when do backward computing of the final loss in each training process.

Crazyai · August 17, 2017, 4:35am

Emm… I have solved the problem. Are you using the latest version (0.2.0)? I went back to version 0.1.12, and the problem is fixed. I think maybe it is a bug.

stacked.twix · August 22, 2017, 2:39pm

Version of which package do you mean?
I’m facing the same issue with CUDA8.0.61, Python 2.7, Ubuntu 14.04

Crazyai · August 22, 2017, 3:20pm

CUDA8.0, CUDNN6.0, Python 2.7(3.5 the same problem), Ubuntu16.04. I roll back to PyTorch version 0.1.12, then the problem is fixed.

stacked.twix · August 22, 2017, 4:31pm

Downgrading works for me too. Thanks!

Crazyai · August 23, 2017, 1:24am

Glad to help you.

xiahouzuoxin · August 23, 2017, 7:44am

Same Error too
CUDA 8.0, pytorch v0.2.0, python 2.7

smth · August 23, 2017, 9:45am

can any of you tell me the output of nvidia-smi

Crazyai · August 23, 2017, 11:37am

The output of nvidia-smi of mine:

Wed Aug 23 19:34:42 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 0000:08:00.0     Off |                  N/A |
| 36%   60C    P2    96W / 250W |   6795MiB / 12189MiB |     61%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    Off  | 0000:09:00.0     Off |                  N/A |
| 30%   52C    P2    56W / 250W |  10363MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN X (Pascal)    Off  | 0000:88:00.0     Off |                  N/A |
| 35%   59C    P2   105W / 250W |   6845MiB / 12189MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN X (Pascal)    Off  | 0000:89:00.0     Off |                  N/A |
| 40%   66C    P2   127W / 250W |   5027MiB / 12189MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     73255    C   ./build/tools/caffe                           6793MiB |
|    1      8248    C   ./build/tools/caffe                           4381MiB |
|    1      8464    C   ./build/tools/caffe                           2469MiB |
|    1     67991    C   python                                        3511MiB |
|    2     23050    C   ./build/tools/caffe                           6843MiB |
|    3     23050    C   ./build/tools/caffe                           5025MiB |
+-----------------------------------------------------------------------------+

But the confusing part is that the python script implements a3c using CPU, none of business with GPU(from my perspective).

florin · August 25, 2017, 9:50pm

Some more info here: https://github.com/pytorch/pytorch/issues/2517#issuecomment-325039259

I think this can be solved by setting mp.set_start_method('spawn') before any cuda call (including setting the rng, for example) when dealing with cuda share_memory objects.

Jiankai · August 27, 2017, 12:17pm

Same error!

But

AttributeError: 'module' object has no attribute 'set_start_method'

It seems that torch.multiprocessing has no set_start_method()

Jiankai · August 27, 2017, 12:23pm

I meet the same error.
Could you please tell me how to go back to version 0.1.12? The method is not provided on http://pytorch.org/
Thank you.

Crazyai · August 27, 2017, 12:38pm

If not familiar with Git, you can just go to this website: https://github.com/pytorch/pytorch/releases to download v0.1.12 source code and build pytorch from scratch following the instructions provided in README of pytorch’s repo.

florin · August 27, 2017, 12:42pm

I am sure torch.multiprocessing has this method implemented. Maybe you are on an OS other than Linux and the method is not implemented in python’s multiprocessing in the first place?

Jiankai · August 27, 2017, 1:20pm

Probably because Python 2.7 has no set_start_method