Custom function bring about 2x timing overhead

Forceless · March 6, 2023, 7:41am

My custom function, which runs before every conv layer and only contains one addition and division operation, slowed inference about 2x the original inference time.

I want to know how to accelerate it as fast as I can.

Thank you

example

from torchvision import models
import torch

def zscore_dr_hook(module: torch.nn.Module, data: tuple) -> None:
    if torch.any((module.weight-0.2)/0.4 > 1000):
        pass

res = models.resnet50(pretrained=True).to('cuda')
rand_input = torch.randn(8, 3, 224, 224, device='cuda:0')
for i in range(100):
    res(rand_input)

keys = ['conv1.weight',
        'layer1.0.conv1.weight',
        'layer1.0.conv2.weight',
        'layer1.0.conv3.weight',
        'layer1.0.downsample.0.weight',
        'layer1.1.conv1.weight',
        'layer1.1.conv2.weight',
        'layer1.1.conv3.weight',
        'layer1.2.conv1.weight',
        'layer1.2.conv2.weight',
        'layer1.2.conv3.weight',
        'layer2.0.conv1.weight',
        'layer2.0.conv2.weight',
        'layer2.0.conv3.weight',
        'layer2.0.downsample.0.weight',
        'layer2.1.conv1.weight',
        'layer2.1.conv2.weight',
        'layer2.1.conv3.weight',
        'layer2.2.conv1.weight',
        'layer2.2.conv2.weight',
        'layer2.2.conv3.weight',
        'layer2.3.conv1.weight',
        'layer2.3.conv2.weight',
        'layer2.3.conv3.weight',
        'layer3.0.conv1.weight',
        'layer3.0.conv2.weight',
        'layer3.0.conv3.weight',
        'layer3.0.downsample.0.weight',
        'layer3.1.conv1.weight',
        'layer3.1.conv2.weight',
        'layer3.1.conv3.weight',
        'layer3.2.conv1.weight',
        'layer3.2.conv2.weight',
        'layer3.2.conv3.weight',
        'layer3.3.conv1.weight',
        'layer3.3.conv2.weight',
        'layer3.3.conv3.weight',
        'layer3.4.conv1.weight',
        'layer3.4.conv2.weight',
        'layer3.4.conv3.weight',
        'layer3.5.conv1.weight',
        'layer3.5.conv2.weight',
        'layer3.5.conv3.weight',
        'layer4.0.conv1.weight',
        'layer4.0.conv2.weight',
        'layer4.0.conv3.weight',
        'layer4.0.downsample.0.weight',
        'layer4.1.conv1.weight',
        'layer4.1.conv2.weight',
        'layer4.1.conv3.weight',
        'layer4.2.conv1.weight',
        'layer4.2.conv2.weight',
        'layer4.2.conv3.weight',
        'fc.weight']
handles = []

def register_hook(
    model, hook
) -> None:
    for key in keys:
        key = key.rsplit(".", 1)[0]
        module = model.get_submodule(key)
        handles.append(module.register_forward_pre_hook(hook=hook))


register_hook(res, zscore_dr_hook)


def remove_hook(model) -> None:
    [i.remove() for i in handles]

Forceless · March 6, 2023, 7:54am

One of the possible solutions I found is to transform this function to libtorch, any advice?

Forceless · March 6, 2023, 12:11pm

@ptrblck Sorry to bother you.
I found that you’ve helped many people and excel at pytorch, would you please give me some advice?

TzviNoy · March 6, 2023, 12:35pm

I assume the reason for the slowdown is the use of hook. Usually during training, intermediate features maps are saved for backward calculations while in inference time there is no need for them, so less time for transfer those tensors into memory back and forth.
But when you use hooks you ask pytorch to save those maps during inference so you lose the speed that regular inference brings with it.
For recap, I thing the overhead is not your simple math operations but the use of hooks.
I have no brilliant solution but to change the model itself - what it seems you tried to avoid. Hope someone will have a better idea.

Forceless · March 6, 2023, 12:41pm

Hi, I tried implementing this without hook like this:

def odr_check(self: torch.nn.Conv2d, x: torch.tensor):
    if (self.weight >2).any():
        mean = torch.median(self.weight)
        std = torch.median(torch.abs(self.weight-mean))
        #std, mean = torch.std_mean(self.weight)
        # split_val = torch.quantile(x, 0.9)
        # x = torch.where(x < split_val, x, y)
        zscore = torch.abs((self.weight - mean) / std)
        outliers = [tuple(i) for i in torch.nonzero(zscore > 1000)]
        for idx in outliers:
            outlier = self.weight[idx]
            bits = float_to_bin(outlier)
            bits = bits[0]+'0'+bits[2:]
            self.weight[idx] = bin_to_float(bits)
    return self._conv_forward(x, self.weight, self.bias)
conv.forward = MethodType(odr_check,conv)

and I wrote some code make forward function of every conv layer change, but the time overhead did not decrease a bit. Maybe hook function was not the bottleneck or changed forward function in a wrong implementing way.

ptrblck · March 6, 2023, 8:09pm

To get a good understanding what exactly is slowing down your code you could profile it with e.g. Nsight Systems as described here.
Based on your code you are using data-dependent control flow by checking the weights in:

if torch.any((module.weight-0.2)/0.4 > 1000):

which will add synchronizations to your code and would block the CPU.

Forceless · March 8, 2023, 8:50am

Any good idea about improving this problem?
I was so confused about this profiler you mentioned about.
And you mentioned that this operation will add synchronization, but wouldn’t conv operation do so?
Thanks

ptrblck · March 8, 2023, 9:05am

If you need to use these if-conditions there won’t be a way to remove the synchronizations.
No, conv operations or other kernels won’t add synchronizations with the host unless you need to read actual values.
Your Python script is executed by the CPU which is responsible for the dispatching, the kernel launches, and the general program execution.
If you just execute the forward pass without printing any values or without any conditions which depend on output values from the model, no synchronizations will be added and the CPU can “run ahead” with the kernel scheduling. Assuming the actual GPU workload is not tiny, this would allow the CPU to schedule the kernels fast enough so that the GPU is busy and the kernels are “packed”, i.e. they are executed without much delay.
However, if the GPU workload is tiny in comparison to the time it takes the CPU to launch the next kernel or if you force the CPU to synchronize, the GPU might have to wait before the next CUDA kernel can be launched.
Since your if-condition depends on values stored on the GPU, these values must be transferred to the host so that the CPU can use this value to continue with the Python script execution.
The performance guide goes into more detail about this.

Forceless · March 8, 2023, 11:37am

Sorry to bother you again, you did helped me. After I speculate my purpose, I think if-conditions can be removed and this function will become a one-line code

data*= ~(torch.abs((data - 0.2) / 0.4) > 1000)

but it still takes a lot of time( about 30%), I carefully read this this section, and I didn’t find out any obvious block operation.
Or may this bit operation and multiplication be implicit control flow?

Is there any way make it faster, like even lower that10%

Forceless · March 8, 2023, 11:49am

Other thing I found that is hook function seems add syncrhonization to host, after I add function by modify the forward function and delete the control flow, it cost down about 50%

Forceless · March 8, 2023, 11:57am

I want to obtain the highest efficiency and it is important to me.
To gain better efficiency, I suppose to use libtorch compile function or transform models to torchscript, do you have any advice about it?

Forceless · March 8, 2023, 12:35pm

I tried jit.trace like this:

jit_check_model = torch.jit.trace(res.forward,torch.randn(1,3,224,224,device='cuda:0'))

it looks work well, the time cost halved and now it takes about 10%

ptrblck · March 8, 2023, 8:49pm

Yes, scripting the model or using torch.compile might speed up your code.
To further debug your script I would recommend profiling the actual run to see where a bottleneck might still be as explained in the link in my first post here.