I am training a quantization aware model to output an embedding of size (677, 1408) to be as similar as possible to the original embedding. For this, I am using COSINEEMBEDDINGLOSS as the loss function. When i run my training script on cpu, it works fine and the loss is ~ 0.13 per batch. However, when I run the same training script on the GPU, the loss becomes ~ 1.04 per batch and this leads to error. RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2.
How can this be resolved? And is the issue related to the COSINEEMBEDDINGLOSS?
Sorry for the delayed reply. I found out the issue. It has nothing to do with the cosineembeddingloss actually. The issue was with the mismatch between the torch and torchvision versions and the CUDA versions. When installing some libraries for the blip2 model, the torch and torchvision versions got updated as well. I installed some of these libraries with --no-deps during the pip install and now it is working fine.