RuntimeError: after reduction step 2: device-side assert triggered Pytorch on Jupyter Notebook

Flock1 · May 31, 2019, 11:01am

I am training a binary classifier using Pytorch on Jupyter Notebook. The following is the architecture:

class AlexNet(nn.Module):
    def __init__(self, num_classes=1):
        super(AlexNet, self).__init__()
        self.conv_base = nn.Sequential(
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2, bias=False),
            nn.BatchNorm2d(96),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),

            nn.Conv2d(96, 256, kernel_size=5, stride=1, padding=2, bias=False),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),

            nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),

            nn.Conv2d(384, 384, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),

            nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.fc_base = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256*6*6, 4096),
            nn.ReLU(inplace=True),

            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),

            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.conv_base(x)
        x = x.view(x.size(0), 256*6*6)
        x = self.fc_base(x)
        return x

These are my parameters for training:

criterion = nn.BCELoss()

# specify optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=2e-5, momentum=0.9)

When I start the training, I get the following error:

RuntimeError: after reduction step 2: device-side assert triggered

Some weird things are also happening:

It was training at one point in time, but the learning rate was small. So when I changed it and started the training process again, I got the error.
I’m also getting this error when I run this command:

model = AlexNet()
model.cuda()

the error:

RuntimeError: CUDA error: device-side assert triggered

Some solutions that I found which worked temporarily are:

Add os.environ['CUDA_LAUNCH_BLOCKING'] = '1' to the notebook. 2) GPU memory being used. So I removed the process that was running the notebook.

However, nothing seems to be working permanently. Any suggestions?