Conv3D runs very slow in fp16 and bf16

Marute1037 · November 12, 2025, 8:46pm

Hello,

I am experimenting with a pretrained model (Qwen3VL from Huggingface). I notice the embedding layer is exceptionally slow when processing image feature, while the later attention layers can run efficiently.

The model’s visual embedding layer uses Conv3D to perform linear embedding:

class Qwen3VLVisionPatchEmbed(nn.Module):
    def __init__(self, config) -> None:
        super().__init__()
        self.patch_size = config.patch_size
        self.temporal_patch_size = config.temporal_patch_size
        self.in_channels = config.in_channels
        self.embed_dim = config.hidden_size

        kernel_size = [self.temporal_patch_size, self.patch_size, self.patch_size]
        self.proj = nn.Conv3d(self.in_channels, self.embed_dim, kernel_size=kernel_size, stride=kernel_size, bias=True)

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        target_dtype = self.proj.weight.dtype
        hidden_states = hidden_states.view(
            -1, self.in_channels, self.temporal_patch_size, self.patch_size, self.patch_size
        )
        hidden_states = self.proj(hidden_states.to(dtype=target_dtype)).view(-1, self.embed_dim)
        return hidden_states

After some digging, I found that the nn.Conv3D can be significantly slower than nn.Linear under fp16 or bf16, while their performance is normal under fp32.

To verify, I ran a minimal benchmark:

import torch
import torch.nn as nn
import time

print("torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("cuDNN available:", torch.backends.cudnn.is_available())
print("cuDNN version:", torch.backends.cudnn.version())

device = torch.device("cuda")
N = 10240
C, T, H, W = 3, 2, 16, 16
embed_dim = 1024

dtype = torch.bfloat16
x = torch.randn(N, C, T, H, W, device=device, dtype=dtype)

conv = nn.Conv3d(C, embed_dim, kernel_size=(T, H, W), stride=(T, H, W), bias=True).to(device, dtype=dtype)
linear = nn.Linear(C*T*H*W, embed_dim, bias=True).to(device, dtype=dtype)


# timing conv
torch.cuda.synchronize()
t0 = time.time()
_ = conv(x)
torch.cuda.synchronize()
t1 = time.time()

# timing linear
torch.cuda.synchronize()
_ = linear(x.view(N, -1))
torch.cuda.synchronize()
t2 = time.time()

print("Conv3d:", t1 - t0, "s")
print("Linear:", t2 - t1, "s")

With dtype = torch.bfloat16, the output is:

torch version: 2.9.0+cu128
CUDA available: True
cuDNN available: True
cuDNN version: 91002
Conv3d: 42.12006068229675 s
Linear: 0.0008780956268310547 s

With dtype = torch.float16:

Conv3d: 42.59985375404358 s
Linear: 0.0016818046569824219 s

With dtype = torch.float32:

Conv3d: 0.06699419021606445 s
Linear: 0.04182004928588867 s

Local environment:

GPU: NVIDIA RTX 6000 Blackwell
Driver: 570.195.03
CUDA: 12.8
Torch: 2.9.0+cu128
cuDNN: 91002

Am I missing anything here to run things appropriately? Any guidance would be greatly appreciated!

valerian.rey · November 13, 2025, 7:04pm

Seems like there’s an issue for it: Significant Memory Regression in `F.conv3d` with `bfloat16` Inputs in `PyTorch 2.9.0` · Issue #166643 · pytorch/pytorch · GitHub

This comment seems to have a fix (installling cudnn 9.15+ via pip).

Marute1037 · November 13, 2025, 9:02pm

Thank you very much for pointing out those issues and possible solutions. After testing several combinations of torch and cudnn versions, I found that:

torch 2.8.0 + cudnn 9.10
torch 2.9.1 + cudnn 9.15

both resolve the problem. It appears that torch 2.9.0 is the problematic version, as it doesn’t work with either cuDNN version.

Thanks again for your help!