Yet another RuntimeError: CUDA error: an illegal memory access was encountered

Error we get:
We get an illegal memory access error during the forward pass of a linear layer (relatve_pose_regressor module). When this code is run with CUDA_LAUNCH_BLOCKING=1, the following error is received.

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Environment Details:

PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: NVIDIA TITAN RTX
Nvidia driver version: 465.19.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchvision==0.8.2
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.1.243             h6bb024c_0  
[conda] libblas                   3.9.0                     9_mkl    conda-forge
[conda] libcblas                  3.9.0                     9_mkl    conda-forge
[conda] liblapack                 3.9.0                     9_mkl    conda-forge
[conda] mkl                       2021.3.0           h06a4308_520  
[conda] mkl-service               2.4.0            py38h7f8727e_0  
[conda] mkl_fft                   1.3.0            py38h42c9631_2  
[conda] mkl_random                1.2.2            py38h51133e4_0  
[conda] numpy                     1.20.3           py38hf144106_0  
[conda] numpy-base                1.20.3           py38h74d4b33_0  
[conda] pytorch                   1.7.1           py3.8_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] torchaudio                0.7.2                      py38    pytorch
[conda] torchvision               0.8.2                py38_cu101    pytorch

Bug Explained: From what it seems that for a PyTorch linear layer given the environment details above and using a single GPU, the linear layer gives a very high (potentially erroneous) value which overflows the matrix multiplication process. But this happens only when the batch-size i.e. the 0th dimension of the input tensor to the linear layer is less than 9. This bug/error does not appear when using CPU. This bug/error does not appear when using GPU with the input tensor’s batch size higher than 9.

Reproduce Bug:
Install the conda PyTorch version using these specs:

conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch

Then use the following code snippet and change the value of THE_TRICKY_PARAM to above and below 9 to see the error.

THE_TRICKY_PARAM=1  # When this is less than 9 we get an illegal memory access error

def train(epoch):
    model.train()
    
    input = torch.ones([THE_TRICKY_PARAM,3,480,640], dtype=torch.float)
    input = input.to(device)
    image_encoding = model.encoder(input)

    print(image_encoding.shape)
    relativeposeregressed=model.relativeposeregressor(image_encoding)
    print(relativeposeregressed)
    print(relativeposeregressed.shape)
    
    rposQ, rposP, rposN = torch.split(relativeposeregressed, [1, 1, THE_TRICKY_PARAM-2])
    
    print(rposQ)
    print(rposP)
    print(rposN)

    optimizer.zero_grad()

class Flatten(nn.Module):
    def forward(self, input):
        return input.view(input.size(0), -1)
    
encoder_dim = 512
encoder = models.vgg16(pretrained=True)

layers = list(encoder.features.children())[:-2]

for l in layers[:-5]: 
    for p in l.parameters():
        p.requires_grad = False

encoder = nn.Sequential(*layers)
model = nn.Module() 
model.add_module('encoder', encoder)        

relativeposeregressor=nn.Sequential(*[Flatten(),nn.Linear(encoder_dim*30*40,4096),nn.ReLU(),nn.Linear(4096,2),nn.ReLU()])
model.add_module('relativeposeregressor', relativeposeregressor)
cuda = True
device = torch.device("cuda")

model = model.to(device)

optimizer = optim.SGD(filter(lambda p: p.requires_grad, 
    model.parameters()), lr=0.0001,
    momentum=0.9,
    weight_decay=0.001)

scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)

epoch=1
scheduler.step(epoch) 
train(epoch)

Resolution: This behavior is not present in the CUDA 11.1 version installed using the following specs:

conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge

Could you update PyTorch to the latest stable release and check, if you would be still seeing this issue, please?

Thanks. Yes, I can see the same behaviour for the latest stable release (PyTorch 1.9.0) if it is installed with Cuda 10.2. If it is installed with Cuda 11.1, the problem goes away as specified in the original post. Followed the conda specs here: https://pytorch.org/