RuntimeError: cuda runtime error (3)

I hope to add a ResNet in the transform.compose utility. And then apply the dataLoader to create batches for training. When the ResNet and images are both on CPU, it works. But when the ResNet and images are both on GPU, there is a cuda problem:

File “”, line 4, in
transforms.Lambda(lambda x: FM(resnet, x))])
File “”, line 23, in FM
input_imgs = Variable(imgs.cuda(),volatile=True)
File “/project/focus/hong/anaconda3/lib/python3.5/site-packages/torch/”, line 65, in cuda
return new_type(self.size()).copy(self, async)
RuntimeError: cuda runtime error (3) initialization error at /py/conda-bld/pytorch_1490983232023/work/torch/lib/THC/generic/THCStorage.c:55

What’s wrong with .cuda()?

data_transforms = transforms.Compose([transforms.ToTensor(),
                                      transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
                                      transforms.Lambda(lambda x: FM(resnet, x))])

def FM(model, img):
    [d,h,w] = img.size()
    if w > h:
        w_new, h_new, s_new = int((w - h)/2), 0, h // NGSIZE
        w_new, h_new, s_new = 0, int((h - w)/2), w // NGSIZE

    imgs = torch.Tensor(NGSIZE**2,d,s_new,s_new)

    # crop the center part of the image
    i = 0
    for nh in range(NGSIZE):
        for nw in range(NGSIZE):
            Hsta = h_new + nh * s_new
            Wsta = w_new + nw * s_new
            Hend = h_new + nh * s_new + s_new
            Wend = w_new + nw * s_new + s_new
            imgs[i,:,:,:] = img[:,Hsta:Hend,Wsta:Wend]
            i += 1
    if use_gpu:
        input_imgs = Variable(imgs.cuda())
        input_imgs = Variable(imgs)

    # forward
    outputs = model.forward(input_imgs).squeeze()
    return outputs.cpu().data.view(3,3,512)

%% data loader
dsets = {x: datasets.ImageFolder(os.path.join(data_dir, x), data_transforms) for x in ['train', 'val']}
dset_loaders = {x:[x], batch_size=1,
                                               shuffle=False, num_workers=16)
                for x in ['train', 'val']}

As mentioned here:
If you want to use CUDA, you have to use the forkserver or spawn start methods.

DataLoader with anything greater than 0 workers uses multiprocessing.