About the NVFUSER

Afer reading ACCELERATE YOUR SCRIPTS WITH NVFUSER, we began to test the acceleration effect of nvfuser.In some networks such as ResNet50,DenseNet,torchscript can achieve ~2.x speedup(vs eager mode),but when we tested in BascicVSR, we did’t see any speedup,why?

Our script:

import os
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
from mmedit.models.registry import BACKBONES
from torchvision import models
@torch.no_grad()
def profile_workload_nn(forward_func,inputs, iteration_count=100, label=""):
    
    # Perform warm-up iterations
    for _ in range(20):
        # Run model, forward and backward
        output = forward_func(inputs)

    
    # Synchronize the GPU before starting the timer
    torch.cuda.synchronize()
    start = time.perf_counter()
    for _ in range(iteration_count):
        # Run model, forward and backward
        output = forward_func(inputs)


    # Synchronize the GPU before stopping the timer
    torch.cuda.synchronize()
    stop = time.perf_counter()
    iters_per_second = iteration_count / (stop - start)
    if label:
        print(label)
    print("Average iterations per second: {:.2f}".format(iters_per_second))
def test_nvfuser():
    eager_module = BACKBONES.get('BasicVSRNet')()
    # eager_module=models.resnet50(pretrained=False)
    # eager_module=models.densenet121(pretrained=False)
    eager_module.half().cuda()
    input1 = torch.ones(1, 5,3, 512, 512).half().cuda()
    eager_module.eval()
    with torch.no_grad():
        eager_result = eager_module(input1)

        profile_workload_nn(
        eager_module, input1, iteration_count=100, label="Eager")
        trace_module = torch.jit.trace(eager_module,input1)
        profile_workload_nn(
        trace_module, input1, iteration_count=100, label="TorchScript")
        trace_result=trace_module(input1)
        error=torch.abs(eager_result-trace_result).mean()
        print(error)

if __name__ == "__main__":
    test_nvfuser()
 

The observed speedup depends on the model architecture and in particular which operations are used. In the last stable release (PyTorch 1.12.0) nvFuser was targeting pointwise, reduction, and normalization operations. To see the latest development install the latest nightly binary and rerun your scripts.

Thanks for your reply, our pytorch version is 1.12.1+cu116 ,and GPU is RTX 3090 Ti. We also printed the profiling info refered to NVFUSER, the fused kernel is indeed generated, but it does not accelerate.