PyTorch Versions CUDA Forward Time

Hi!
I’m working on the official DGCNN (Wang et Al.) PyTorch implementation but I’m encountering strange behaviours among different PyTorch versions (always using Cuda compilation tools V10.1.243).
For some reason the CUDA forward time of a batch seems to be higher (ms --> s) when moving to newer pytorch versions (from 1.4 going on). The tested code is exactly the same, the cuda toolkit also. Any idea why this happens? Thanks!

PyTorch 1.3.1 Forward pass Profiling:
Self CPU time total: 6.875ms
CUDA time total: 264.906ms

PyTorch 1.4 Forward pass Profiling:
Self CPU time total: 27.976ms CUDA time total: 1.498s

PyTorch 1.5 Forward pass Profiling:
Self CPU time total: 12.93ms CUDA time total: 1.434s

(Those are just time measurement for forwarding a batch through the model, the trend is that for newer version of pytorch (>1.4) CUDA time total is in seconds while for older (<1.3.1) is in ms )

This is quite surprising indeed.
Do you have a small code sample (30 lines) that we could use to reproduce this?

Hi, reproducing it into 30 lines of code is quite challenging, I’m trying to isolate the specific part causing this ‘slow forward’ but at the moment I didn’t had success.
Here you can see a detailed profiling of all ops done during forward (https://pastebin.com/RYRFFruE).
The model I’m using is the ‘DGCNN’ from https://github.com/WangYueFt/dgcnn/blob/master/pytorch/model.py (Wang et Al.), in the forward a nearest neighboors graph is constructed at each layer. Still trying to understand if the problem is in layers forward or in some of the ops used for graph construction.
Thank you

Edit:
If I exclude the graph construction part (using random tensor) forward performances seems to be the comparable between 1.2 and 1.5. At the same time, by profiling the function used for graph construction between the two versions I’m not experiencing very different behaviours… I’m confused

Interesting. Do both of them build the same graph? Or could there be some discrepancies here by any chance?

Nope, the graph is constructed in the same way. Of course the graph depends on the input. I also tried to substitute my input data with torch.rand() tensor and then measure the forward time (results in the pastebin) and also in this case seems that pytorch 1.2 provides way better performances than 1.5.
Any idea on how to further debug this? Thank you! :blush:

EDIT 1:
Investigating a little bit I’ve produced some tracing plot for PyT 1.2 and 1.5… The problem seems to be in batch normalization layer… am I wrong?

PyTorch 1.2 cudnn enabled:


PyTorch 1.5 cudnn enabled:

Edit 2:
Interestingly… by setting ‘torch.backends.cudnn.enabled = False’ performance gets better (half the CUDA time) with PyTorch 1.5, still more time than with PyTorch 1.2 and cudnn enabled.

Do you use the same cudnn version in both cases?
Also make sure to run your code a few times before doing the timing. cudnn benchmarking will run the forward multiple times to find the best kernel if it is enabled.

What is the input size for the batchnorm? Does running just the batchnorm with this input size gives similar behavior?

PyTorch 1.2 CUDNN 7602, cudnn.enabled=True ==> OK
PyTorch 1.3 CUDNN 7603, cudnn.enabled=True ==> OK
PyTorch 1.5 CUDNN 7603, cudnn.enabled=True ==> NOT OK (slow forward)

I’ve tried those configurations… seems that the problem is not in CUDNN but somewhere in torch 1.5. Actually I perform BatchNormalization after each layer so there are more than one BN layer, here is the model:
DGCNN(
(conv1): Sequential(
(0): Conv2d(6, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
)
(conv2): Sequential(
(0): Conv2d(128, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
)
(conv3): Sequential(
(0): Conv2d(128, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
)
(conv4): Sequential(
(0): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
)
(conv5): Sequential(
(0): Conv1d(512, 1024, kernel_size=(1,), stride=(1,), bias=False)
(1): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
)
(cls): Sequential(
(0): Linear(in_features=2048, out_features=512, bias=False)
(1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): LeakyReLU(negative_slope=0.2)
(3): Dropout(p=0.5, inplace=False)
(4): Linear(in_features=512, out_features=256, bias=True)
(5): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(6): LeakyReLU(negative_slope=0.2)
(7): Dropout(p=0.5, inplace=False)
(8): Linear(in_features=256, out_features=40, bias=True)
)
)

I’ll keep investigating in profiling the batchnorm layers forward. The problem is not in cudnn benchmark, this is not the first forward and I also tried to disable it.
Actually I’m stuck with PyTorch 1.3 and some libraries working only with torch > 1.4…

Well the benchmark you show seems to indicate that the batchnorm function is the problem. If you have cudnn enabled, then pytorch does nothing and just forwards to it.

What is the input size for the batchnorm? Does running just the batchnorm with this input size gives similar behavior?

Could you also post the model code, please?
I cannot infer the connection between conv4 and conv5, as you are switching from 2D layers to 1D ones and also the input channels do not match the output channels between the convX blocks.

You’re right! There are some max ops. in between, then the max-feat from the 4 conv. layers are concatenated and fed to the Conv1d.
Input is a batch of PointClouds so a tensor with dim. [numShapesInBatch, 3, numPointsPerShape]

This is the forward code:

def forward(self, x):
        batch_size = x.size(0)
        x = get_graph_feature(x, k=self.k)  # (batch_size, 3, num_points) => (batch_size, 3*2, num_points, k)
        x = self.conv1(x)  # (batch_size, 3*2, num_points, k) => (batch_size, 64, num_points, k)
        x1 = x.max(dim=-1, keepdim=False)[0]  # (batch_size, 64, num_points, k) => (batch_size, 64, num_points)

        x = get_graph_feature(x1, k=self.k)
        x = self.conv2(x)
        x2 = x.max(dim=-1, keepdim=False)[0]  # => (batch_size, 64, num_points)

        x = get_graph_feature(x2, k=self.k)
        x = self.conv3(x)
        x3 = x.max(dim=-1, keepdim=False)[0]  # => (batch_size, 128, num_points)

        x = get_graph_feature(x3, k=self.k)
        x = self.conv4(x)
        x4 = x.max(dim=-1, keepdim=False)[0]  # => (batch_size, 256, num_points)

        x = torch.cat((x1, x2, x3, x4), dim=1)  # => (batch_size, 64+64+128+256, num_points)

        x = self.conv5(x)  #(batch_size, 64+64+128+256, num_points) => (batch_size, emb_dims, num_points)
        x1 = F.adaptive_max_pool1d(x, 1).view(batch_size, -1)
        x2 = F.adaptive_avg_pool1d(x, 1).view(batch_size, -1)
        x = torch.cat((x1, x2), 1)

        # Classification Forward
        x = self.cls(x)
        return x

(P.S. the code is from https://github.com/WangYueFt/dgcnn/blob/20fdb459ca5d10fe8aba1d296e66340f65990b85/pytorch/model.py#L88 )