What is a normal inference speed for Resnet34 and how to improve it?

Hi guys,

I’m new to deep learning.

I have a classification task and do the transfer learning using Resnet34(the implementation by PyTorch). The dataset I used is Stanford Car Dataset. For training, I used pretrained weights and fine-tuned the model using the vehicle dataset. The accuracy I got on train dataset and validation dataset is 99% and 89%, respectively (Btw, is there any over-fitting? I’m not sure).

When I do the inference, I feed one image(224x224) to my fine-tuned model and got correct label. But the inference speed is quite slow: around 160 ms. I think the speed should be much faster than this.

So what is a normal inference speed for Resnet34?

And how to increase the inference speed? (I know half precision may help, but not really sure.)

Anyone can help me?

Thanks!

The inference speed depends on a lot of factors, e.g.:

  • are you loading each image, pushing to the device, and running a single forward pass?
    If so, are you adding the timing of all steps or just the actual forward pass?
  • Half precision might help, if you use a GPU with Volta/Tensor Cores
  • Yes, it looks like your model is overfitting on the training data, as the accuracy gap is quite large

Thanks for you reply!

  1. This is part of my code. I load only one image, and just calculate the timing for forward pass.
def test_image(test_path):
    device = torch.device("cuda")
    #test_path: "./CarDataset/data/test/00001.jpg"
    input = cv2.imread(test_path)
    original = input

    input = cv2.cvtColor(input, cv2.COLOR_BGR2RGB)
    input = Image.fromarray(input)

    mean, std = [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]
    normalize = transforms.Normalize(mean=mean, std=std)

    transform_test = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
        normalize
        ])

    input = transform_test(input)
    input = torch.unsqueeze(input, 0)
    input = input.to(device)

    model = models.resnet34()
    num_ftrs = model.fc.in_features
    model.fc = nn.Linear(num_ftrs, 196)
    model.load_state_dict(torch.load(model_path)["state_dict"])

    model.to(device)
    model.eval()

    start = time.time()
    output = model(input)
    end = time.time()
    print("processing time: ", (end-start)*1000, "ms")

    _, pred = torch.max(output, 1)
    
    text = str(pred.data.item())
    input = input.squeeze(0).permute(1,2,0).cpu().numpy()

    cv2.putText(original, text, (10,50), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 2)
    cv2.imshow(test_path, original)

    cv2.waitKey(0)
  1. Could you give me some suggestions about how to avoid overfitting? This is the optimizer I used during training.
    criterion = nn.CrossEntropyLoss().cuda()
    optimizer = torch.optim.SGD(model.parameters(), 0.01,
                                momentum=0.9,
                                nesterov=False,
                                weight_decay=1e-4)
    sched = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', patience=10)

Thanks again!

Thanks for the code!
Since CUDA operations are asynchronous, you should synchronize the code before starting and stopping the timer via torch.cuda.synchronize().

Also, generally you should add some warmup iterations to get stable results.
Especially, if you are using torch.backends.cudnn.benchmark = True, as the first iterations will time different kernel implementations and will chose the fastest ones for your workload.

Adding regularization like Dropout, a more aggressive data augmentation, early stopping etc. could help countering overfitting.

You’re correct! I do need a warmup for my gpu. The processing time per image has decreased to 3-4 ms.

For the overfitting, I have tried dropout(drop rate 0.5) but it did work. Here is my code for build the last classifier:

#input size is 224x224
model = models.resnet152(pretrained=True)
num_ftrs = model.fc.in_features
model.fc = build_classifier(num_ftrs, [1050,500], 196)

def build_classifier(num_in_features, hidden_layers, num_out_features):
    classifier = nn.Sequential()
    # when we don't have any hidden layers
    if hidden_layers == None:      
        classifier.add_module('fc0', nn.Linear(num_in_features, 196))    
    #when we have hidden layers
    else:      
        layer_sizes = zip(hidden_layers[:-1], hidden_layers[1:])
        classifier.add_module('fc0', nn.Linear(num_in_features, hidden_layers[0]))
        classifier.add_module('relu0', nn.ReLU())
        classifier.add_module('drop0', nn.Dropout(.5))
        
        for i, (h1, h2) in enumerate(layer_sizes):
            classifier.add_module('fc'+str(i+1), nn.Linear(h1, h2))
            classifier.add_module('relu'+str(i+1), nn.ReLU())
            classifier.add_module('drop'+str(i+1), nn.Dropout(.5))
        
        classifier.add_module('output', nn.Linear(hidden_layers[-1], num_out_features))
        
    return classifier

Did I do something wrong?

I think only synchronize after the code block is fine.
One thing I found is that for large torch models, such as ResNet50, the latency measured from adding synchronize does not make too much difference to the one from without synchronize. According to PyTorch semantics, PyTorch does implicit synchronization when there is memory copy. However, I actually don’t see there are memory copy in those models. So I suspect that there are some PyTorch ops used in the model that do synchronizations. Can you please provide some explanations or details? Thank you.

If you are throwing out the first iteration, then synchronizing before stopping the timer might be working.
In any case, torch.utils.benchmark provides a safe approach to profile workloads, which I would generally recommend to void these issues coming from manual profiling.

I don’t think the size of the model matters, but the general ability of the CPU to run ahead.
I.e. if the model contains a lot of (small) layers, data-dependent control flow etc. your profiling might not see a large difference between a synchronized and an async run. However, I usually don’t dig into profiles, which were created without syncs, as the majority is wrong.

1 Like

Thanks. I was actually assuming there was no GPU activities or it has already been synchronized before the first iteration. Having the synchronization before the first iteration makes sense based on your description.
I will also check the benchmark tool.