Running Llama-3.3-70b-instruct: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

I am trying to run a quantised version of Llama-3.3-70b-instruct on my NVIDIA 6000 Ada GPU. I get the following error:

  File "/mnt/c/Users/user1/myenv/lib/python3.12/site-packages/bitsandbytes/autograd/_functions.py", line 462, in forward
    output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

nvcc --version shows CUDA 12.0. nvidia-smi shows the driver version as 553.50 and CUDA as 12.4. My understanding is that the 12.0/12.4 difference is not impactful. The VRAM is not being exceeded when the model is loaded.
Do you know what might be causing this? I have put the script below.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Llama-3.3-70B-Instruct"
token_content = "hf_XXXXXXXXXXXXXXXXXXXXX"
# 1. Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    token=token_content,
    cache_dir="./huggingface_cache",
    use_fast=False  # Some Llama-based models do not have fast tokenizers yet
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    token=token_content,
    cache_dir="./huggingface_cache",
    load_in_4bit=True,
    device_map="auto",
    torch_dtype=torch.float16
)

prompt = "Explain the concept of Reinforcement Learning in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        top_p=0.95,
        temperature=0.7
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)