CUDA Error when doing multiprocessing on CPU

I am trying to run the A3C algorithm using the code provided here: GitHub - ikostrikov/pytorch-a3c: PyTorch implementation of Asynchronous Advantage Actor Critic (A3C) from "Asynchronous Methods for Deep Reinforcement Learning".

After fixing some trivial bugs in the code that deal with expand_as, I get an error thrown:

[2017-07-30 21:27:20,583] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,900] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,905] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,915] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,922] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,933] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,938] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,950] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,950] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,982] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,973] Making new env: PongDeterministic-v4
[2017-07-30 21:27:20,999] Making new env: PongDeterministic-v4
[2017-07-30 21:27:21,022] Making new env: PongDeterministic-v4
[2017-07-30 21:27:21,029] Making new env: PongDeterministic-v4
[2017-07-30 21:27:21,073] Making new env: PongDeterministic-v4
[2017-07-30 21:27:21,077] Making new env: PongDeterministic-v4
[2017-07-30 21:27:21,107] Making new env: PongDeterministic-v4
[2017-07-30 21:27:21,122] Making new env: PongDeterministic-v4
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called recursively
terminate called after throwing an instance of ‘std::runtime_error’
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called recursively
terminate called after throwing an instance of ‘std::runtime_error’
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error
terminate called after throwing an instance of ‘std::runtime_error’
what(): CUDA error (3): initialization error

The code I am referencing does not make any calls to CUDA, so why would I be getting this error?

Thanks.

We come across the same problem, any solution to this?

I’ve run into the same issue. Another thing is that pytorch seems to be allocating space on the GPU, despite moving no tensors there

I still have not found a solution to this.

The same error. Have you found the solution? It is strange that when I run the same python script on another computer with the same configuration(CUDA8.0 + CUDNN6 + Python2.7, but different OS: Ubuntu14.04 vs Ubuntu16.04), it works well.

It seems that this error occurs when do backward computing of the final loss in each training process.

Emm… I have solved the problem. Are you using the latest version (0.2.0)? I went back to version 0.1.12, and the problem is fixed. I think maybe it is a bug.

Version of which package do you mean?
I’m facing the same issue with CUDA8.0.61, Python 2.7, Ubuntu 14.04

CUDA8.0, CUDNN6.0, Python 2.7(3.5 the same problem), Ubuntu16.04. I roll back to PyTorch version 0.1.12, then the problem is fixed.

1 Like

Downgrading works for me too. Thanks!

Glad to help you. :stuck_out_tongue:

Same Error too
CUDA 8.0, pytorch v0.2.0, python 2.7

can any of you tell me the output of nvidia-smi

The output of nvidia-smi of mine:

Wed Aug 23 19:34:42 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 0000:08:00.0     Off |                  N/A |
| 36%   60C    P2    96W / 250W |   6795MiB / 12189MiB |     61%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    Off  | 0000:09:00.0     Off |                  N/A |
| 30%   52C    P2    56W / 250W |  10363MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN X (Pascal)    Off  | 0000:88:00.0     Off |                  N/A |
| 35%   59C    P2   105W / 250W |   6845MiB / 12189MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN X (Pascal)    Off  | 0000:89:00.0     Off |                  N/A |
| 40%   66C    P2   127W / 250W |   5027MiB / 12189MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     73255    C   ./build/tools/caffe                           6793MiB |
|    1      8248    C   ./build/tools/caffe                           4381MiB |
|    1      8464    C   ./build/tools/caffe                           2469MiB |
|    1     67991    C   python                                        3511MiB |
|    2     23050    C   ./build/tools/caffe                           6843MiB |
|    3     23050    C   ./build/tools/caffe                           5025MiB |
+-----------------------------------------------------------------------------+

But the confusing part is that the python script implements a3c using CPU, none of business with GPU(from my perspective).

Some more info here: https://github.com/pytorch/pytorch/issues/2517#issuecomment-325039259

I think this can be solved by setting mp.set_start_method('spawn') before any cuda call (including setting the rng, for example) when dealing with cuda share_memory objects.

Same error!

But

AttributeError: 'module' object has no attribute 'set_start_method'

It seems that torch.multiprocessing has no set_start_method()

I meet the same error.
Could you please tell me how to go back to version 0.1.12? The method is not provided on http://pytorch.org/
Thank you.

If not familiar with Git, you can just go to this website: https://github.com/pytorch/pytorch/releases to download v0.1.12 source code and build pytorch from scratch following the instructions provided in README of pytorch’s repo.

I am sure torch.multiprocessing has this method implemented. Maybe you are on an OS other than Linux and the method is not implemented in python’s multiprocessing in the first place?

Probably because Python 2.7 has no set_start_method:no_mouth::no_mouth:

3 Likes