Kernel did not dies but no return any result after certain epochs

Hi there,
I am facing issue with my code. Here are the explanations:

  1. Output from anaconda prompt:
    Traceback (most recent call last):
    File “C:\Users\User\anaconda3\envs\pytorch-gpu\lib\multiprocessing\queues.py”, line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
    File “C:\Users\User\anaconda3\envs\pytorch-gpu\lib\multiprocessing\reduction.py”, line 51, in dumps
    cls(buf, protocol).dump(obj)
    File “C:\Users\User\anaconda3\envs\pytorch-gpu\lib\site-packages\torch\multiprocessing\reductions.py”, line 261, in reduce_tensor
    event_sync_required) = storage.share_cuda()
    File “C:\Users\User\anaconda3\envs\pytorch-gpu\lib\site-packages\torch\storage.py”, line 920, in share_cuda
    return self._untyped_storage.share_cuda(*args, **kwargs)
    RuntimeError: CUDA error: mapping of buffer object failed
    Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

  2. Output from Jupyter Notebook: it runs until certain epoch then eventually not returning anything until the error above shows up.

  3. My model:
    class CNNNetwork(nn.Module):
    def init(self):
    super(CNNNetwork, self).init()
    self.conv1 = nn.Sequential(nn.Conv2d
    (in_channels=1, out_channels=16,
    kernel_size=3, stride=1,
    padding=2),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2))
    self.conv2 = nn.Sequential(nn.Conv2d
    (in_channels=16, out_channels=32,
    kernel_size=3, stride=1,
    padding=2),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2))
    self.conv3 = nn.Sequential(nn.Conv2d
    (in_channels=32, out_channels=64,
    kernel_size=3, stride=1,
    padding=2),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2))
    self.conv4 = nn.Sequential(nn.Conv2d
    (in_channels=64, out_channels=128,
    kernel_size=3, stride=1,
    padding=2),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2))
    self.flatten = nn.Flatten()
    self.fc1 = nn.Linear(128220, 120)
    self.fc2 = nn.Linear(120, 60)
    self.fc3 = nn.Linear(60, 2)

    def forward(self, x):
    x = self.conv1(x)
    x = self.conv2(x)
    x = self.conv3(x)
    x = self.conv4(x)
    x = self.flatten(x)
    x = x.view(-1, 128220)
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

I am still new to this, hope anyone can help me.
Thanks!

Try to remove the usage of multiple processes, e.g. by setting num_workers=0, and also make sure the if-clause guard is used as explained here.

1 Like

Thank you so much for your response!
It took me a day to solve this issue.
Thank you!!!