Resnet execution speed on a GTX 1080

I’ve got torch 0.4.1 on python3.5. A paper I’m trying to reproduce claims they have a 13ms execution time for a model based on ResNet50 on GTX 1080 Ti using Caffe.

I’ve been able to translate the exact network to PyTorch using MMdnn, however the execution of the ResNet part alone takes between 29 and 39ms alone on my end, using a GTX 1080 and the entire network takes between 35 and 54ms (I’m suprised it varies so much between subsequent executions, is that normal in PyTorch?). I’ve tried to look at the torchvision’s resnet50 model for comparison, but the execution time is even worse: 53ms for the stub I have in common with my network and 58ms for all ResNet50.

I understand that a GTX 1080 Ti is better than a 1080, but still the difference is too large. Unfortunately, I haven’t been able to run the network on Caffe on my machine for comparison, as Caffe is hell to compile (I need a custom layer).

Here is my code for reproduction:

import numpy as np
import torch
from timeit import default_timer as timer
from torchvision.models import resnet50

def main():
    # Define model and input data
    resnet = resnet50().cuda()
    x = torch.from_numpy(np.random.rand(1, 3, 224, 224).astype(np.float32)).cuda()  # Entire network
    # x = torch.from_numpy(np.random.rand(1, 64, 32, 32).astype(np.float32)).cuda()   # Stub alone

    # The first pass is always slower, so run it once
    # Measure elapsed time
    passes = 20
    total_time = 0
    for _ in range(passes):
        start = timer()
        delta = timer() - start
        print('Forward pass: %.3fs' % delta)
        total_time += delta
    print('Average forward pass: %.3fs' % (total_time / passes))

if __name__ == '__main__':

When I refer to the stub, it means I commented out the following lines in torchvision/models/

    def forward(self, x):
        # x = self.conv1(x)
        # x = self.bn1(x)
        # x = self.relu(x)
        # x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        # x = self.avgpool(x)
        # x = x.view(x.size(0), -1)
        # x = self.fc(x)

        return x

Well my problem just sort of got solved… I managed to get my hands temporarily on a GTX 1080 Ti and the difference in perfs are actually that big: 10ms on average for all resnet50. I can’t say I expected that.

edit: Actually it seems to be a matter of configuration… if I run on Linux on my own machine, I get 12ms, which is much better (vs 58ms on Windows). Guess I gotta keep exploring…