Error we get:
We get an illegal memory access error during the forward pass of a linear layer (relatve_pose_regressor module). When this code is run with CUDA_LAUNCH_BLOCKING=1, the following error is received.
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Environment Details:
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: NVIDIA TITAN RTX
Nvidia driver version: 465.19.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchvision==0.8.2
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.1.243             h6bb024c_0  
[conda] libblas                   3.9.0                     9_mkl    conda-forge
[conda] libcblas                  3.9.0                     9_mkl    conda-forge
[conda] liblapack                 3.9.0                     9_mkl    conda-forge
[conda] mkl                       2021.3.0           h06a4308_520  
[conda] mkl-service               2.4.0            py38h7f8727e_0  
[conda] mkl_fft                   1.3.0            py38h42c9631_2  
[conda] mkl_random                1.2.2            py38h51133e4_0  
[conda] numpy                     1.20.3           py38hf144106_0  
[conda] numpy-base                1.20.3           py38h74d4b33_0  
[conda] pytorch                   1.7.1           py3.8_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] torchaudio                0.7.2                      py38    pytorch
[conda] torchvision               0.8.2                py38_cu101    pytorch
Bug Explained: From what it seems that for a PyTorch linear layer given the environment details above and using a single GPU, the linear layer gives a very high (potentially erroneous) value which overflows the matrix multiplication process. But this happens only when the batch-size i.e. the 0th dimension of the input tensor to the linear layer is less than 9. This bug/error does not appear when using CPU. This bug/error does not appear when using GPU with the input tensor’s batch size higher than 9.
Reproduce Bug:
Install the conda PyTorch version using these specs:
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch
Then use the following code snippet and change the value of THE_TRICKY_PARAM to above and below 9 to see the error.
THE_TRICKY_PARAM=1  # When this is less than 9 we get an illegal memory access error
def train(epoch):
    model.train()
    
    input = torch.ones([THE_TRICKY_PARAM,3,480,640], dtype=torch.float)
    input = input.to(device)
    image_encoding = model.encoder(input)
    print(image_encoding.shape)
    relativeposeregressed=model.relativeposeregressor(image_encoding)
    print(relativeposeregressed)
    print(relativeposeregressed.shape)
    
    rposQ, rposP, rposN = torch.split(relativeposeregressed, [1, 1, THE_TRICKY_PARAM-2])
    
    print(rposQ)
    print(rposP)
    print(rposN)
    optimizer.zero_grad()
class Flatten(nn.Module):
    def forward(self, input):
        return input.view(input.size(0), -1)
    
encoder_dim = 512
encoder = models.vgg16(pretrained=True)
layers = list(encoder.features.children())[:-2]
for l in layers[:-5]: 
    for p in l.parameters():
        p.requires_grad = False
encoder = nn.Sequential(*layers)
model = nn.Module() 
model.add_module('encoder', encoder)        
relativeposeregressor=nn.Sequential(*[Flatten(),nn.Linear(encoder_dim*30*40,4096),nn.ReLU(),nn.Linear(4096,2),nn.ReLU()])
model.add_module('relativeposeregressor', relativeposeregressor)
cuda = True
device = torch.device("cuda")
model = model.to(device)
optimizer = optim.SGD(filter(lambda p: p.requires_grad, 
    model.parameters()), lr=0.0001,
    momentum=0.9,
    weight_decay=0.001)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)
epoch=1
scheduler.step(epoch) 
train(epoch)
Resolution: This behavior is not present in the CUDA 11.1 version installed using the following specs:
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge