Not using multiprocessing, but getting CUDA error re. forked subprocess

mcskwayrd · August 28, 2019, 9:11pm

Hi, I have some code that was working with PyTorch a couple releases ago.

But with the latest pip version (stable, Linux, CUDA 10.0, Python 3.7) I get an error:
"RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method"

But I’m not using multiprocessing. Or DataParallel either.

The extent of my ‘parallelization’ is the following…

def worker_init(worker_id):
    """
    used with PyTorch DataLoader so that we can grab random bits of files or
    synth random input data on the fly
    Without this you get the same thing every epoch
    """
    # NOTE that this current implementation prevents strict reproducability
    np.random.seed()

Then in my training code…

dataloader = DataLoader(my_dataset, ...., num_workers=10, worker_init_fn=worker_init)

and the error occurs at the following line in my code:

for x, y in dataloader:

The error is:
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
…more lines of error message, ending with…

"Cannot re-initialize CUDA in forked subprocess. " + msg)

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

I was not aware I was re-initializing CUDA.

As a possible fix anyway, I tried adding the lines of code that I see recommended if one were using mutliprocessing:

from torch.multiprocessing import Pool, Process, set_start_method, cpu_count
try:
    set_start_method('spawn')
except RuntimeError:
    pass

…but these have no effect.

As per this thread, my dataloader loads a dataset my_dataset which is purely numpy arrays on the CPU, and moves data to the GPU one line after the error is occuring…

for x, y in dataloader:   # << error occurs here
    x_cuda, y_cuda = x.to(device), y.to(device)

But unlike that aforementioned thread, I definitely want to keep multiple workers in my DataLoader! But even if I set num_workers=1, I still get the same error.

So… the other answers I’ve seen seem to assume you’re using multiprocessing, but I’m not.

Can anyone suggest how to fix this?

Thanks.

EDIT: Full trace follows…

Traceback (most recent call last):
  File "main_script.py", line 99, in <module>
    apex_opt=args.apex, target_type=args.target, lr_max=args.lrmax, in_checkpointname=args.checkpoint)
  File "mycode.py", line 267, in train
    y_size, parallel, logfilename, out_checkpointname, sr=sr, lr_max=lr_max)
  File "mycode.py", line 104, in train_loop
    for x, y in dataloader:
  File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in __next__
    return self._process_data(data)
  File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 80, in default_collate
    return [default_collate(samples) for samples in transposed]
  File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 80, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 65, in default_collate
    return default_collate([torch.as_tensor(b) for b in batch])
  File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 65, in <listcomp>
    return default_collate([torch.as_tensor(b) for b in batch])
  File "/home/myusername//anaconda3/envs/myenv/lib/python3.7/site-packages/torch/cuda/__init__.py", line 177, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

P.S.- Setting pin_memory to either True or False yields the same error.
P.P.S.- One thing on my system that did change: I downgraded from CUDA 10.1 to 10.0. Is it possible that this error message is really indicating some sort of CUDA version mismatch resulting from (perhaps) incompletely removing CUDA 10.1?

ptrblck · August 28, 2019, 11:36pm

It seems as if some CUDATensors are initialized in your Dataset.
Could you post the code of your Dataset or check again, if you are (accidentally) pushing some tensors to a device?

mcskwayrd · August 29, 2019, 12:33am

Thanks for your response @ptrblck.

My Dataset code is perhaps too long to post in a ‘message’, but here’s a link: https://github.com/drscotthawley/signaltrain/blob/master/signaltrain/datasets.py Actually, this file contains two different versions to use for Dataset, and the same error occurs for both.

My Dataset code is all numpy; the only PyTorch-specific code is
“from torch.utils.data import Dataset” and the definition of my Dataset. …So I can’t see how any CUDATensors could be getting initialized.

One other thought I had was: I wondered if maybe NVIDIA’s Apex Mixed-Precision system might be affecting things somehow, so I turned that off. Same error message.

SOLUTION!!:* Without changing my code at all, I downgraded PyTorch 1.2.0 to 1.0.1, and now my code is running fine again with no errors:

$ conda install pytorch=1.0.1 -c pytorch 
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/myusername/anaconda3/envs/myenv

  added / updated specs:
    - pytorch=1.0.1


The following packages will be SUPERSEDED by a higher-priority channel:

  torchvision        pytorch/linux-64::torchvision-0.4.0-p~ --> pytorch/noarch::torchvision-0.2.2-py_3

The following packages will be DOWNGRADED:

  pytorch             1.2.0-py3.7_cuda10.0.130_cudnn7.6.2_0 --> 1.0.1-py3.7_cuda10.0.130_cudnn7.4.2_2

Any ideas as to what might have changed in newer versions of PyTorch that ‘broke’ my code?

Thanks.

ptrblck · August 29, 2019, 10:06pm

Thanks for the link!
I’m not that familiar with numba, but it seems in audio.py you are using the jit annotation from it. Could it be that numba initializes a CUDA context, thus raising this issue?
Could you try to remove the jit for the sake of debugging and try it again?

I’m not sure, what might have broken the code in the latest release, but it would be interesting, if numba has a role in it.

sonack · November 16, 2019, 11:38am

Have you figured out why? I meet the same problem with you.
downgrading to 1.0.1 solved, but why 1.2.0 will occur this error?

ptrblck · November 17, 2019, 12:26am

Do you have a code snippet to reproduce this error?
Also, does updating to 1.3.1 help?

e.pignatelli · November 19, 2019, 6:37pm

Hi @ptrblck,

I experienced the same problem
I tested on 1.2.0 and 1.3.1, both installed through conda from the pytorch channel - no improvements.

Below is the smallest example I could build to reproduce.
Initialising the model on multiple GPUs and using the ToDevice creates the problem.
Commenting out one or the other, or constructing the DataLoader with num_workers=0 solves it.

import torch
import torchvision
from torch.utils.data import Dataset, DataLoader


class DummyDataset(Dataset):
    def __init__(self, transform=None):
        super().__init__()
        self.transform = transform

    def __getitem__(self, idx):
        sample = torch.randn(2, 2, 2)

        if self.transform is not None:
            sample = self.transform(sample)
        return sample

    def __len__(self):
        return 10


class ToDevice(object):
    def __init__(self, device):
        self.device = device

    def __call__(self, sample):
        return sample.to(self.device)


class DummyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return torch.randn(2, 2, 2)

if __name__ == "__main__":
    device = torch.device("cuda")

    transform = None
    transform = torchvision.transforms.Compose([ToDevice(device)])  # A. commenting this removes exception

    dataset = DummyDataset(transform=transform)

    model = DummyModel().to(device)
    model = torch.nn.DataParallel(model)  # B. or commenting this - same story

    loader = DataLoader(dataset, num_workers=4)  # C. or setting num_workers=0
    for sample in loader:
        print(sample)
        break

It fails with error:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

ptrblck · November 19, 2019, 7:03pm

You should be able to specify the start method via:

torch.multiprocessing.set_start_method('spawn')

Using this approach, you might encounter another limitation of Python’s multiprocessing, which should work, if you import the Dataset from another script:

# dummy_dataloading.py
import torch
from torch.utils.data import Dataset

class DummyDataset(Dataset):
    def __init__(self, transform=None):
        super().__init__()
        self.transform = transform

    def __getitem__(self, idx):
        sample = torch.randn(2, 2, 2)

        if self.transform is not None:
            sample = self.transform(sample)
        return sample

    def __len__(self):
        return 10


class ToDevice(object):
    def __init__(self, device):
        self.device = device

    def __call__(self, sample):
        return sample.to(self.device)


class DummyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return torch.randn(2, 2, 2)

# main.py
import torch
torch.multiprocessing.set_start_method('spawn')
import torchvision
from torch.utils.data import DataLoader

from dummy_dataloading import ToDevice, DummyDataset, DummyModel


if __name__ == "__main__":
    device = torch.device("cuda")

    transform = None
    transform = torchvision.transforms.Compose([ToDevice(device)])  # A. commenting this removes exception

    dataset = DummyDataset(transform=transform)

    model = DummyModel().to(device)
    #model = torch.nn.DataParallel(model, device_ids=[0])  # B. or commenting this - same story

    loader = DataLoader(dataset, num_workers=4)  # C. or setting num_workers=0
    for sample in loader:
        print(sample)
        break

However, the usual approach would be to load the data onto the CPU and transfer each sample to the GPU to

save GPU memory
hide the loading latency by the actual GPU workload (model training)

If you are working with a small dataset and can fit it easily on the GPU, you could probably avoid using a DataLoader at all (or just slice the samples in the main process).

e.pignatelli · November 19, 2019, 7:14pm

Thanks for the quick reply.
The DataLoader is a necessity, actually, but also a very nice concept to work with.

I just tested the code, but it still fails - the error is below.
Running with num_workers=0 return no errors.

RuntimeError: DataLoader worker (pid 30756) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

111231 · February 1, 2020, 10:37am

I had the same procedure while multiprocessing.
hi @e.pignatelli, @mcskwayrd
i recently solve it.

I hope this solution I propose will help you.

The cause of this problem is multiprocessing’s platform.
More specificly, Cuda’s library requests “spawn,” while Dataloader requests “fork.”

I think some dataloader problem and cada problem are related to this problem.

import torch
import torchvision
torch.multiprocessing.set_start_method('spawn')# good solution !!!!
from torch.utils.data import Dataset, DataLoader
import time

class DummyDataset(Dataset):
    def __init__(self, transform=None):
        super().__init__()
        self.transform = transform

    def __getitem__(self, idx):
        time.sleep(0.1);# !!!!!! In order to test, should be have virtual process time !!!!!!
        sample = torch.randn(2, 2, 2)
        if self.transform is not None:
            sample = self.transform(sample)
        return sample

    def __len__(self):
        return 100


class ToDevice(object):
    def __init__(self, device):
        self.device = device

    def __call__(self, sample):
        return sample.to(self.device)


class DummyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return torch.randn(2, 2, 2)

if __name__ == "__main__":
    device = torch.device("cuda")

    transform = None
    #transform = torchvision.transforms.Compose([ToDevice(device)])  # A. commenting this removes exception
    # ^ this problem line -> multiprocessing spawn platform is make new child process with the exception of handler

    #dataset = DummyDataset(transform=transform)
    dataset = DummyDataset()

    model = DummyModel().to(device)
    model = torch.nn.DataParallel(model)  # B. or commenting this - same story

    loader = DataLoader(dataset, num_workers=8)  # C. or setting num_workers=0
    
    ####################################################################
    # if num_worker > 0 then Dataloader should be required 'fork' platform.
    #  if do not change platform of multiprocessing from 'spawn' to 'fork' then you will see the error or very slow time.
    # https://pytorch.org/docs/stable/data.html
    try:
        torch.multiprocessing.set_start_method('fork',force=True)
    except RuntimeError:
        pass
    ####################################################################
    
    
    for epoch in range(2):
        start_millis = int(round(time.time() * 1000));
        
        for step,sample in enumerate(loader):
            sample.to(device)
            #       ^ transform here

        end_millis = int(round(time.time() * 1000));
        print(epoch+1,"epoch time: "+str(end_millis-start_millis)+" ms");

nofreewill · June 13, 2020, 12:07pm

I have tensors on the GPU in my dataset and setting num_workers=0 helped me.
Also I had to set pin_memory=False.
Thank you very much!

drscotthawley · June 20, 2020, 6:58pm

I ‘solved’ this problem by…essentially, no longer trying to do this anymore.
I was getting the error while “generating data on the fly”, and instead I now prefer to pre-generate my data and save it to a set of files on disk first.