CUDA error: CUBLAS_STATUS_EXECUTION_FAILED

I ran into a strange problem while compiling pytorch from source to support MPI backend.

My setup:
4 rtx 2080ti GPU
pytorch 1.6.0
cuda version 10.1

The problem happens when I tried to compile pytorch 1.6.0 from source.
I followed the steps from the pytorch website. GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration
The compilation when smoothly and I was able to see all the 4 GPUs on my machine.

But when I tested with a simple code that I have been using for a while, I got the following error message.

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

The same code runs perfectly when I use the pytorch install directly from pip on the same machine.
Can anyone give some advices to fix this problem?

Following is the script that I used to compile pytorch from source.

mkdir $HOME/.openmpi/
wget https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz
gunzip -c openmpi-3.0.0.tar.gz | tar xf - \
    && cd openmpi-3.0.0 \
    && ./configure --prefix=$HOME/.openmpi/ --with-cuda=/usr/local/cuda-10.1 \
    && make all install
export PATH=$HOME/.openmpi/bin:$PATH
export LD_LIBRARY_PATH=$HOME/.openmpi/lib:$LD_LIBRARY_PATH

######### add cuda directories 

export CUDA_NVCC_EXECUTABLE="/usr/local/cuda-10.1/bin/nvcc"
export CUDA_HOME="/usr/local/cuda-10.1/"
export CUDNN_INCLUDE_PATH="/usr/local/cuda-10.1/include/"
export CUDNN_LIBRARY_PATH="/usr/local/cuda-10.1/lib64/"
export LIBRARY_PATH="/usr/local/cuda-10.1/lib64"
export USE_CUDA=1 USE_CUDNN=1 USE_MKLDNN=1 MAX_JOBS=80


export ENV_NAME=pt16
conda create --name $ENV_NAME python=3.7.3 numpy=1.16.3 
conda activate $ENV_NAME 
conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing_extensions

# use cuda101 to match with the system 
conda install -c pytorch magma-cuda101 
export CMAKE_PREFIX_PATH="$HOME/anaconda3/envs/$ENV_NAME"
cd ~/anaconda3/envs/$ENV_NAME/compiler_compat
mv ld ld-old
sudo apt-get install libomp-dev
cd ~
git clone --recursive https://github.com/pytorch/pytorch pytorch && \
cd pytorch && \
git checkout v1.6.0 --recurse-submodules && \
python setup.py install | tee 2>&1 > compile.log

I don’t know which pip wheel you’ve installed, but you could try to rebuild PyTorch with the same cublas version and check, if you might be seeing an already fixed issue.
Also, make sure that you are not running out of memory, as cublas might raise this unhelpful error message if it’s unable to allocate memory internally.
In any case, I would also recommend to stick to the latest PyTorch release (1.8.1), as it ships with the latest bug fixes (you might also hit an internal PyTorch issue, which might have been fixed after the 1.6.0 release).

Hi @ptrblck,

Thanks for your response.
I tried to build PyTorch from source using different versions.
I have tested 1.6, 1.7 and the latest one, 1.9. But all of them are having the same cublas issue.
The memory is not running out when I tested by build.
Because I can run the same code on the same machine changing to the PyTorch version that I directly install using pip.
I believe I have set all the CUDA related environmental variables correctly reflect my system setting; the configure step at the start of the build was able to pick up the corresponding CUDA paths I needed.

Thanks for the update. Which cublas version are you using in your local installation?

Should be cublas 10.1.
Because I specifically set all the environmental variables point to my CUDA 10.1 path.
Is there a way I can check?

Could you update to cublas10.2.2, which ships with CUDA10.2?
This CUDA toolkit (with the cublas version) is used in the binaries. Since you are using an older cublas release, you might hit an already fixed issue.