RuntimeError: CUDA error: misaligned address

I am trying to inference using florence-2-large-ft model. However I get this error when trying to run the code in GPU, but not any errors when using CPU. What is the reason for behavior? and how to solve this.

(llm-planner) root@78b14ed0ea55:/home/2-6-2025/LLM-Planner/e2e# python sample_florence_large.py 
Using cpu with torch.float16
Loading model...
/opt/conda/envs/llm-planner/lib/python3.8/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
Florence2LanguageForConditionalGeneration has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Model loaded successfully
Inputs processed
/opt/conda/envs/llm-planner/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:638: UserWarning: `num_beams` is set to 1. However, `early_stopping` is set to `True` -- this flag is only used in beam-based generation modes. You should set `num_beams>1` or unset `early_stopping`.
  warnings.warn(
Generation completed
{'<OD>': {'bboxes': [[34.880001068115234, 160.55999755859375, 597.4400024414062, 371.7599792480469], [453.44000244140625, 276.7200012207031, 553.9199829101562, 370.79998779296875], [93.75999450683594, 280.55999755859375, 197.44000244140625, 371.2799987792969]], 'labels': ['car', 'wheel', 'wheel']}}

-------------------------------------------------------------------------- 

(llm-planner) root@78b14ed0ea55:/home/2-6-2025/LLM-Planner/e2e# python sample_florence_large.py 
Total GPU memory: 47.7 GB
Using CUDA with torch.float16
Loading model...
/opt/conda/envs/llm-planner/lib/python3.8/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
Florence2LanguageForConditionalGeneration has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Model loaded successfully
Inputs processed
Error during model execution: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Traceback (most recent call last):
  File "sample_florence_large.py", line 71, in <module>
    generated_ids = model.generate(
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 2793, in generate
    image_features = self._encode_image(pixel_values)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 2603, in _encode_image
    x = self.vision_tower.forward_features_unpool(pixel_values)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 647, in forward_features_unpool
    x, input_size = block(x, input_size)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 206, in forward
    inputs = module(*inputs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 206, in forward
    inputs = module(*inputs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 493, in forward
    x, size = self.window_attn(x, size)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 222, in forward
    x, size = self.fn(self.norm(x), *args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 444, in forward
    qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 117, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Traceback (most recent call last):
  File "sample_florence_large.py", line 93, in <module>
    torch.cuda.empty_cache()
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/cuda/memory.py", line 170, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


--------------------------------------------------------------------------

(llm-planner) root@78b14ed0ea55:/home/2-6-2025/LLM-Planner/e2e# CUDA_LAUNCH_BLOCKING=1 & python sample_florence_large.py
[1] 2751
Total GPU memory: 47.7 GB
Using CUDA with torch.float16
Loading model...
/opt/conda/envs/llm-planner/lib/python3.8/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
Florence2LanguageForConditionalGeneration has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Model loaded successfully
Inputs processed
Error during model execution: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "sample_florence_large.py", line 71, in <module>
    generated_ids = model.generate(
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 2793, in generate
    image_features = self._encode_image(pixel_values)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 2603, in _encode_image
    x = self.vision_tower.forward_features_unpool(pixel_values)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 647, in forward_features_unpool
    x, input_size = block(x, input_size)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 206, in forward
    inputs = module(*inputs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 206, in forward
    inputs = module(*inputs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 497, in forward
    x, size = self.ffn(x, size)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 222, in forward
    x, size = self.fn(self.norm(x), *args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 252, in forward
    return self.net(x), size
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/container.py", line 219, in forward
    input = module(input)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 705, in forward
    return F.gelu(input, approximate=self.approximate)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "sample_florence_large.py", line 93, in <module>
    torch.cuda.empty_cache()
  File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/cuda/memory.py", line 170, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[1]+  Done                    CUDA_LAUNCH_BLOCKING=1

(llm-planner) root@78b14ed0ea55:/home/2-6-2025/LLM-Planner/e2e# 

I use this python file 'sample_florence_large.py` for testing.

# from https://huggingface.co/microsoft/Florence-2-large-ft
import requests

import torch 
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM 
import gc

def setup_device_and_dtype():
    """Setup device and dtype with proper error handling"""
    device = "cuda" # I CHANGE HERE
    try:
        if device == "cuda":
            # Clear memory first
            torch.cuda.empty_cache()
            gc.collect()
            
            # Check available memory
            total_memory = torch.cuda.get_device_properties(0).total_memory
            print(f"Total GPU memory: {total_memory / 1e9:.1f} GB")
            
            # Test basic CUDA operation
            test_tensor = torch.ones(1, device="cuda")
            del test_tensor
            torch.cuda.empty_cache()
            
            # Use float32 to avoid mixed precision issues
            torch_dtype = torch.float16
            print(f"Using CUDA with {torch_dtype}")
            return device, torch_dtype
            
        elif device == "cpu":
            torch_dtype = torch.float16
            print(f"Using cpu with {torch_dtype}")
            return device, torch.float16

    except Exception as e:
        print(f"CUDA setup failed: {e}")
        print("Falling back to CPU")
        return device, torch.float16

device, torch_dtype = setup_device_and_dtype()

try:
    # Load model with error handling
    print("Loading model...")
    model = AutoModelForCausalLM.from_pretrained(
        "microsoft/Florence-2-base-ft",
        torch_dtype=torch_dtype,
        trust_remote_code=True,
        low_cpu_mem_usage=True,
        device_map=device
    ).to(device)
    
    processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
    print("Model loaded successfully")
    
    # Rest of your code...
    prompt = "<OD>"
    url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
    image = Image.open(requests.get(url, stream=True).raw)
    
    with torch.no_grad():
        inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)
        print("Inputs processed")
        
        # Clear cache before generation
        if device == "cuda":
            torch.cuda.empty_cache()
            
        generated_ids = model.generate(
            input_ids=inputs["input_ids"],
            pixel_values=inputs["pixel_values"],
            max_new_tokens=512,  # Reduce token limit
            do_sample=False,
            num_beams=1  # Reduce beam search
        )
        
        print("Generation completed")
    
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
    print(parsed_answer)
    
except Exception as e:
    print(f"Error during model execution: {e}")
    import traceback
    traceback.print_exc()
    
finally:
    # Clean up
    if device == "cuda":
        torch.cuda.empty_cache()
    gc.collect()

And GPU details are as follows

Fri Jun  6 07:53:51 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    Off |   00000000:01:00.0 Off |                    0 |
| 30%   36C    P8             22W /  300W |     149MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
(llm-planner) root@78b14ed0ea55:/home/2-6-2025/LLM-Planner/e2e# 

How to solve this unexpected behavior? Thank you in advance for your time!

You might be running OOM on your device causing the creation of the cublasHandle to fail. Could you try to reduce the batch size (or another parameter) to reduce the memory usage?