I am trying to inference using florence-2-large-ft model. However I get this error when trying to run the code in GPU, but not any errors when using CPU. What is the reason for behavior? and how to solve this.
(llm-planner) root@78b14ed0ea55:/home/2-6-2025/LLM-Planner/e2e# python sample_florence_large.py
Using cpu with torch.float16
Loading model...
/opt/conda/envs/llm-planner/lib/python3.8/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
Florence2LanguageForConditionalGeneration has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
- If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
- If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
- If you are not the owner of the model architecture class, please contact the model code owner to update it.
Model loaded successfully
Inputs processed
/opt/conda/envs/llm-planner/lib/python3.8/site-packages/transformers/generation/configuration_utils.py:638: UserWarning: `num_beams` is set to 1. However, `early_stopping` is set to `True` -- this flag is only used in beam-based generation modes. You should set `num_beams>1` or unset `early_stopping`.
warnings.warn(
Generation completed
{'<OD>': {'bboxes': [[34.880001068115234, 160.55999755859375, 597.4400024414062, 371.7599792480469], [453.44000244140625, 276.7200012207031, 553.9199829101562, 370.79998779296875], [93.75999450683594, 280.55999755859375, 197.44000244140625, 371.2799987792969]], 'labels': ['car', 'wheel', 'wheel']}}
--------------------------------------------------------------------------
(llm-planner) root@78b14ed0ea55:/home/2-6-2025/LLM-Planner/e2e# python sample_florence_large.py
Total GPU memory: 47.7 GB
Using CUDA with torch.float16
Loading model...
/opt/conda/envs/llm-planner/lib/python3.8/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
Florence2LanguageForConditionalGeneration has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
- If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
- If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
- If you are not the owner of the model architecture class, please contact the model code owner to update it.
Model loaded successfully
Inputs processed
Error during model execution: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Traceback (most recent call last):
File "sample_florence_large.py", line 71, in <module>
generated_ids = model.generate(
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 2793, in generate
image_features = self._encode_image(pixel_values)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 2603, in _encode_image
x = self.vision_tower.forward_features_unpool(pixel_values)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 647, in forward_features_unpool
x, input_size = block(x, input_size)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 206, in forward
inputs = module(*inputs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 206, in forward
inputs = module(*inputs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 493, in forward
x, size = self.window_attn(x, size)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 222, in forward
x, size = self.fn(self.norm(x), *args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 444, in forward
qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 117, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Traceback (most recent call last):
File "sample_florence_large.py", line 93, in <module>
torch.cuda.empty_cache()
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/cuda/memory.py", line 170, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
--------------------------------------------------------------------------
(llm-planner) root@78b14ed0ea55:/home/2-6-2025/LLM-Planner/e2e# CUDA_LAUNCH_BLOCKING=1 & python sample_florence_large.py
[1] 2751
Total GPU memory: 47.7 GB
Using CUDA with torch.float16
Loading model...
/opt/conda/envs/llm-planner/lib/python3.8/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
Florence2LanguageForConditionalGeneration has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
- If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
- If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
- If you are not the owner of the model architecture class, please contact the model code owner to update it.
Model loaded successfully
Inputs processed
Error during model execution: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "sample_florence_large.py", line 71, in <module>
generated_ids = model.generate(
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 2793, in generate
image_features = self._encode_image(pixel_values)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 2603, in _encode_image
x = self.vision_tower.forward_features_unpool(pixel_values)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 647, in forward_features_unpool
x, input_size = block(x, input_size)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 206, in forward
inputs = module(*inputs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 206, in forward
inputs = module(*inputs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 497, in forward
x, size = self.ffn(x, size)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 222, in forward
x, size = self.fn(self.norm(x), *args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Florence-2-base-ft/9803f52844ec1ae5df004e6089262e9a23e527fd/modeling_florence2.py", line 252, in forward
return self.net(x), size
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/container.py", line 219, in forward
input = module(input)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 705, in forward
return F.gelu(input, approximate=self.approximate)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "sample_florence_large.py", line 93, in <module>
torch.cuda.empty_cache()
File "/opt/conda/envs/llm-planner/lib/python3.8/site-packages/torch/cuda/memory.py", line 170, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[1]+ Done CUDA_LAUNCH_BLOCKING=1
(llm-planner) root@78b14ed0ea55:/home/2-6-2025/LLM-Planner/e2e#
I use this python file 'sample_florence_large.py` for testing.
# from https://huggingface.co/microsoft/Florence-2-large-ft
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
import gc
def setup_device_and_dtype():
"""Setup device and dtype with proper error handling"""
device = "cuda" # I CHANGE HERE
try:
if device == "cuda":
# Clear memory first
torch.cuda.empty_cache()
gc.collect()
# Check available memory
total_memory = torch.cuda.get_device_properties(0).total_memory
print(f"Total GPU memory: {total_memory / 1e9:.1f} GB")
# Test basic CUDA operation
test_tensor = torch.ones(1, device="cuda")
del test_tensor
torch.cuda.empty_cache()
# Use float32 to avoid mixed precision issues
torch_dtype = torch.float16
print(f"Using CUDA with {torch_dtype}")
return device, torch_dtype
elif device == "cpu":
torch_dtype = torch.float16
print(f"Using cpu with {torch_dtype}")
return device, torch.float16
except Exception as e:
print(f"CUDA setup failed: {e}")
print("Falling back to CPU")
return device, torch.float16
device, torch_dtype = setup_device_and_dtype()
try:
# Load model with error handling
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Florence-2-base-ft",
torch_dtype=torch_dtype,
trust_remote_code=True,
low_cpu_mem_usage=True,
device_map=device
).to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-base-ft", trust_remote_code=True)
print("Model loaded successfully")
# Rest of your code...
prompt = "<OD>"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
with torch.no_grad():
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device, torch_dtype)
print("Inputs processed")
# Clear cache before generation
if device == "cuda":
torch.cuda.empty_cache()
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512, # Reduce token limit
do_sample=False,
num_beams=1 # Reduce beam search
)
print("Generation completed")
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
print(parsed_answer)
except Exception as e:
print(f"Error during model execution: {e}")
import traceback
traceback.print_exc()
finally:
# Clean up
if device == "cuda":
torch.cuda.empty_cache()
gc.collect()
And GPU details are as follows
Fri Jun 6 07:53:51 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX 6000 Ada Gene... Off | 00000000:01:00.0 Off | 0 |
| 30% 36C P8 22W / 300W | 149MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
(llm-planner) root@78b14ed0ea55:/home/2-6-2025/LLM-Planner/e2e#
How to solve this unexpected behavior? Thank you in advance for your time!