Conv3d tensor core utilisation

Hi,

I have a problem concerning the inference of models with Conv3d layers on Nvidia (V100) and AMD (MI100) GPU’s. It seems like the tensor cores are not utilised at all by this layer. I tried again with a fully connected layer and could verify the utilisation using nvprof. Below, I try to provide a minimal code example (speed.py)

import numpy as np
import torch
import torch.nn as nn

class test_network(nn.Module):

    def __init__(self, window_size=11):
        
        super().__init__()
        self.fc1 = nn.Linear(7*7*7, 128)
        self.relu = nn.ReLU()
    
    def forward(self, X):
        return self.relu(self.conv11(X))

class test_network(nn.Module):

    def __init__(self):
        
        super().__init__()
        self.conv1 = nn.Conv3d(in_channels=1, out_channels=8, kernel_size=(3,3,3), padding=(1,0,0))
        self.relu = nn.ReLU()
    
    def forward(self, X):
        return self.relu(self.fc1(X))

test_net = test_network()

number_elements = 1000000
number_loops = 50

dtype=torch.float16
device=("cuda" if torch.cuda.is_available() else "cpu")

print("Cuda is available:", torch.cuda.is_available())

test_net = test_net.to(dtype=dtype, device=device)


with torch.no_grad():
    for i in range(number_loops):
        torch.cuda.synchronize()
        test = torch.ones((number_elements,1,7,7,7), dtype=dtype, device=device)
        out = test_net(test)
        torch.cuda.synchronize()
        del test

For using this script, just (un-)comment the model that is to be tested and adapt the input size (either (number_elements,1,7,7,7) for the convolutional network or (number_elements,7 * 7 * 7) for the fully connected network).

I then test the utilisation (Nvidia machine) using
sudo -E /usr/local/cuda/bin/nvprof -m tensor_precision_fu_utilization /usr/bin/python3 speed.py

My setup: Nvidia V100 PCle, Cuda Driver version 12.2, latest PyTorch (pip) installation.

Below you can find a screenshot of nvprof for the Conv3d model. (Unfortunately, as a new user, I cannot embed more than one image in my post).

(Conv model)

I have not explicitly tested something like nvprof on the AMD machines, however I see similar numbers in terms of TFLOPs, so I assume the problem originates from the same source.

Any help is much appreciated.

Many thanks, kind regards,
Christian