RuntimeError : illegal memory access, Creating MTGP constants failed, multiple GPUS only

Rafael_Valle · November 22, 2017, 12:02am

I’m getting segfaults when using multiple GPUs to interact with a tensor that is used to sample random numbers. The code executes fine with 1 GPU and produces the error(s) below occasionally although the code remains the same. I’ve noticed that the likelihood of producing the error increases with the number of GPUs used.

Here’s one of the errors and the code for reproduction:

$ python minimal.py
THCudaCheck FAIL file=/tmp/pip-qmk9li80-build/torch/lib/THC/generic/THCStorage.cu line=66 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
File “minimal.py”, line 48, in
test()
File “minimal.py”, line 44, in test
net(x_gpu)
File “/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py”, line 60, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py”, line 70, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py”, line 67, in parallel_apply
raise output
File “/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py”, line 42, in _worker
output = module(*input, **kwargs)
File “/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “minimal.py”, line 23, in forward
output_enc = self.conv(output_enc)
File “/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/nn/modules/conv.py”, line 154, in forward
self.padding, self.dilation, self.groups)
File “/opt/conda/envs/pytorch-py35/lib/python3.5/site-packages/torch/nn/functional.py”, line 85, in conv1d
return f(input, weight, bias)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /tmp/pip-qmk9li80-build/torch/lib/THC/generic/THCStorage.cu:66

import torch
from torch import nn
from torch.autograd import Variable


class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()

        self.nz = 1024
        self.nzf = 8
        self.znoise = torch.cuda.FloatTensor(1, self.nz, self.nzf)
        self.znoise = Variable(self.znoise)
        self.conv = nn.Conv1d(1, 16, kernel_size=31, stride=2, padding=15, dilation=1)

    def forward(self, x):
        if self.znoise.size(0) != x.size(0):
            self.znoise.data.resize_(x.size(0), self.nz, self.nzf)
        self.znoise.data.normal_(0., 1).float()
        znoise = self.znoise

        output_enc = x.unsqueeze(1)
        output_enc = self.conv(output_enc)
        return output_enc


def to_gpu(x, inference=False):
    x = x.float()
    if torch.cuda.is_available():
        x = x.cuda(async=True)
    return Variable(x, volatile=inference)


def test(iterations=10, seed=1234, batch_size=600, out_length=16384):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)

    net = torch.nn.DataParallel(Network().cuda())
    x = torch.FloatTensor(batch_size, out_length)

    for i in range(iterations):
        x.normal_(0., 1)
        x_gpu = to_gpu(x)
        net(x_gpu)
        print(i, end=' ')


test()

Rafael_Valle · November 27, 2017, 10:24pm

@SimonW, given that you’ve been working on GANs. Any thoughts on this?
Or @apaszke, given that you written code to THCTensorRandom.cu?

SimonW · November 28, 2017, 7:44pm

What happens if you don’t use dataparallel, but run everything on the second GPU? i.e.

x = x.cuda(1)
net = net.cuda(1)
y = net(x)