RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)` while running fine on the CPU

@albanD Sure, I can test it.

The code runs fine on a machine using a V100 DGXs-16GB (driver 440.33.01) and V100-SXM3-32GB (driver 450.51.06) using the conda PyTorch binaries for 1.7.0 and 1.7.1 with the CUDA runtime 10.2.