Hello,
I am experimenting with a pretrained model (Qwen3VL from Huggingface). I notice the embedding layer is exceptionally slow when processing image feature, while the later attention layers can run efficiently.
The model’s visual embedding layer uses Conv3D to perform linear embedding:
class Qwen3VLVisionPatchEmbed(nn.Module):
def __init__(self, config) -> None:
super().__init__()
self.patch_size = config.patch_size
self.temporal_patch_size = config.temporal_patch_size
self.in_channels = config.in_channels
self.embed_dim = config.hidden_size
kernel_size = [self.temporal_patch_size, self.patch_size, self.patch_size]
self.proj = nn.Conv3d(self.in_channels, self.embed_dim, kernel_size=kernel_size, stride=kernel_size, bias=True)
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
target_dtype = self.proj.weight.dtype
hidden_states = hidden_states.view(
-1, self.in_channels, self.temporal_patch_size, self.patch_size, self.patch_size
)
hidden_states = self.proj(hidden_states.to(dtype=target_dtype)).view(-1, self.embed_dim)
return hidden_states
After some digging, I found that the nn.Conv3D can be significantly slower than nn.Linear under fp16 or bf16, while their performance is normal under fp32.
To verify, I ran a minimal benchmark:
import torch
import torch.nn as nn
import time
print("torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("cuDNN available:", torch.backends.cudnn.is_available())
print("cuDNN version:", torch.backends.cudnn.version())
device = torch.device("cuda")
N = 10240
C, T, H, W = 3, 2, 16, 16
embed_dim = 1024
dtype = torch.bfloat16
x = torch.randn(N, C, T, H, W, device=device, dtype=dtype)
conv = nn.Conv3d(C, embed_dim, kernel_size=(T, H, W), stride=(T, H, W), bias=True).to(device, dtype=dtype)
linear = nn.Linear(C*T*H*W, embed_dim, bias=True).to(device, dtype=dtype)
# timing conv
torch.cuda.synchronize()
t0 = time.time()
_ = conv(x)
torch.cuda.synchronize()
t1 = time.time()
# timing linear
torch.cuda.synchronize()
_ = linear(x.view(N, -1))
torch.cuda.synchronize()
t2 = time.time()
print("Conv3d:", t1 - t0, "s")
print("Linear:", t2 - t1, "s")
With dtype = torch.bfloat16, the output is:
torch version: 2.9.0+cu128
CUDA available: True
cuDNN available: True
cuDNN version: 91002
Conv3d: 42.12006068229675 s
Linear: 0.0008780956268310547 s
With dtype = torch.float16:
Conv3d: 42.59985375404358 s
Linear: 0.0016818046569824219 s
With dtype = torch.float32:
Conv3d: 0.06699419021606445 s
Linear: 0.04182004928588867 s
Local environment:
- GPU: NVIDIA RTX 6000 Blackwell
- Driver: 570.195.03
- CUDA: 12.8
- Torch: 2.9.0+cu128
- cuDNN: 91002
Am I missing anything here to run things appropriately? Any guidance would be greatly appreciated!