Error we get:
We get an illegal memory access error during the forward pass of a linear layer (relatve_pose_regressor module). When this code is run with CUDA_LAUNCH_BLOCKING=1, the following error is received.
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Environment Details:
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: NVIDIA TITAN RTX
Nvidia driver version: 465.19.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] torch==1.7.1
[pip3] torchaudio==0.7.0a0+a853dff
[pip3] torchvision==0.8.2
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.1.243 h6bb024c_0
[conda] libblas 3.9.0 9_mkl conda-forge
[conda] libcblas 3.9.0 9_mkl conda-forge
[conda] liblapack 3.9.0 9_mkl conda-forge
[conda] mkl 2021.3.0 h06a4308_520
[conda] mkl-service 2.4.0 py38h7f8727e_0
[conda] mkl_fft 1.3.0 py38h42c9631_2
[conda] mkl_random 1.2.2 py38h51133e4_0
[conda] numpy 1.20.3 py38hf144106_0
[conda] numpy-base 1.20.3 py38h74d4b33_0
[conda] pytorch 1.7.1 py3.8_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] torchaudio 0.7.2 py38 pytorch
[conda] torchvision 0.8.2 py38_cu101 pytorch
Bug Explained: From what it seems that for a PyTorch linear layer given the environment details above and using a single GPU, the linear layer gives a very high (potentially erroneous) value which overflows the matrix multiplication process. But this happens only when the batch-size i.e. the 0th dimension of the input tensor to the linear layer is less than 9. This bug/error does not appear when using CPU. This bug/error does not appear when using GPU with the input tensor’s batch size higher than 9.
Reproduce Bug:
Install the conda PyTorch version using these specs:
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch
Then use the following code snippet and change the value of THE_TRICKY_PARAM to above and below 9 to see the error.
THE_TRICKY_PARAM=1 # When this is less than 9 we get an illegal memory access error
def train(epoch):
model.train()
input = torch.ones([THE_TRICKY_PARAM,3,480,640], dtype=torch.float)
input = input.to(device)
image_encoding = model.encoder(input)
print(image_encoding.shape)
relativeposeregressed=model.relativeposeregressor(image_encoding)
print(relativeposeregressed)
print(relativeposeregressed.shape)
rposQ, rposP, rposN = torch.split(relativeposeregressed, [1, 1, THE_TRICKY_PARAM-2])
print(rposQ)
print(rposP)
print(rposN)
optimizer.zero_grad()
class Flatten(nn.Module):
def forward(self, input):
return input.view(input.size(0), -1)
encoder_dim = 512
encoder = models.vgg16(pretrained=True)
layers = list(encoder.features.children())[:-2]
for l in layers[:-5]:
for p in l.parameters():
p.requires_grad = False
encoder = nn.Sequential(*layers)
model = nn.Module()
model.add_module('encoder', encoder)
relativeposeregressor=nn.Sequential(*[Flatten(),nn.Linear(encoder_dim*30*40,4096),nn.ReLU(),nn.Linear(4096,2),nn.ReLU()])
model.add_module('relativeposeregressor', relativeposeregressor)
cuda = True
device = torch.device("cuda")
model = model.to(device)
optimizer = optim.SGD(filter(lambda p: p.requires_grad,
model.parameters()), lr=0.0001,
momentum=0.9,
weight_decay=0.001)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)
epoch=1
scheduler.step(epoch)
train(epoch)
Resolution: This behavior is not present in the CUDA 11.1 version installed using the following specs:
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge