Forward processing speed is slow increasingly during iterative run [libtorch. torchscript]

AppleTree · February 26, 2020, 1:15am

Processing speed is slow increasingly, perhaps GPU Memory problem.
Except first process time, early processing time is pretty fast.
However, suddenly processing time is slow after 13th run.
After 30 iteration, program was sleep 3 sec to show that processing time is recovery during 12 iteration.

how can resolve this problem?

Enviroment.
GPU : gtx1060 6gb
Image size : 800x800x1

Code.

torch::jit::script::Module module= torch::jit::load(NetworkPath, torch::kCUDA);
torch::NoGradGuard no_grad;
torch::Tensor outputArr[100];
std::cout << "Start Iteration run" << std::endl;
for (int i = 0; i < 100; i++)	{
	if (i != 0 && i % 30 == 0)	{
		std::cout << "------ each 30 iteration forced Sleep 3sec -----" << std::endl;
		Sleep(3000);
	}
	t = clock();
	outputArr[i] = module.forward({ aa_cu }).toTensor();
	std::cout << "[Time] Iter : " << i+1 << " // Time(ms) : " << clock() - t << std::endl;
}

ptrblck · February 26, 2020, 7:38am

CUDA operations are asynchronous, so you would have to synchronize the timer before starting and stopping the timer.

AppleTree · February 27, 2020, 1:06am

hm… I thought that only the CUDA operations are synchronize in forward function.
as result, whole processing time is also same.
Does it have some way to synchronize forward function ?

Result (Check whole Process time)

Code

t = clock();
torch::Tensor outputArr[5];
std::cout << "Start Iteration run" << std::endl;
for (int i = 0; i < 100; i++)	{
	outputArr[i % 5] = module.forward({ aa_cu }).toTensor();
}
std::cout << "[Time] 100 Iter : " << clock() - t << std::endl;

ptrblck · February 27, 2020, 1:46am

This should work:

CUDAStream stream = getCurrentCUDAStream();
AT_CUDA_CHECK(cudaStreamSynchronize(stream));

AppleTree · February 27, 2020, 2:41am

as your suggestion, i used “cudaStreamSynchronize” function, however nothing to channge.
Even I don’t know if i correctly used the function
i think i didn’t get cuda Stream from “getCurrent CUDA Stream” function.
Because stream variable look like empty as 2nd screenshot.

Should I make the stream to input data for using cudaStreamSynchronize?

Result

Code

Kiki_Rizki_Arpiandi · October 16, 2020, 9:20am

@tom @ptrblck @albanD I am sorry to mention you guys.Please help it happen to me as well. I Syncronized the gpu and the problem still persist.

I use : gtx1050m, cuda10.1, libtorch 1.6

Kiki_Rizki_Arpiandi · October 16, 2020, 9:22am

@AppleTree did u resolve the problem in the end?

ptrblck · October 16, 2020, 9:33am

Could you post a minimal, executable code snippet, which shows the slowdown with proper synchronizations, please?

Kiki_Rizki_Arpiandi · October 16, 2020, 10:10am

Thank you very much for your response here is the minimal working code:


#include <iostream>
#include <torch/torch.h>
#include <torch/script.h>
#include <cuda_runtime_api.h>

int main() {
    torch::jit::script::Module script_module = torch::jit::load("detector.pt");
    std::vector <torch::jit::IValue> inputs;
    auto img_tensor = torch::zeros({1,3,640,640}).cuda();
    inputs.emplace_back(img_tensor);
    for(int i=0;i<10000;i++){
        clock_t begin = clock();
        torch::NoGradGuard no_grad_guard;
        cudaDeviceSynchronize();
        auto output = script_module.forward(inputs);
        cudaDeviceSynchronize();
        clock_t end = clock();
        double elapsed_time = double(end - begin) / 1000;
        if(i%100==0) {
            std::cout << "Iteration : "<<i<<" Elapsed time " << elapsed_time << std::endl;
        }
    }
    return 0;
}

Here is the output

Iteration : 0 Elapsed time 586.484
Iteration : 100 Elapsed time 29.035
Iteration : 200 Elapsed time 28.963
Iteration : 300 Elapsed time 29.021
Iteration : 400 Elapsed time 29.029
Iteration : 500 Elapsed time 29.407
Iteration : 600 Elapsed time 112.363
Iteration : 700 Elapsed time 141.069
Iteration : 800 Elapsed time 34.88
Iteration : 900 Elapsed time 104.44
Iteration : 1000 Elapsed time 116.31
Iteration : 1100 Elapsed time 100.833

Environtment : NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.1, Ubuntu 18.04, GTX1050M, libtorch 1.6

albanD · October 16, 2020, 2:25pm

Hi,

The runtime seems to be actually be fluctuating quite a lot.
Are there other processes that use the CPU or GPU on the same machine?

Kiki_Rizki_Arpiandi · October 19, 2020, 7:40am

Thanks for your response @albanD, I m sure that there is no background process or another heavy lifting process that use CPU or GPU on the same machine, I can also provide the traced model if you need here is the link https://drive.google.com/file/d/1jPVGtMcbqvl3r9oGuqBlbgyKsDLY5RSv/view?usp=sharing

Kiki_Rizki_Arpiandi · October 19, 2020, 9:57am

I tested the executable for memory leak as well, using valgrind. And there is no leak

Kiki_Rizki_Arpiandi · October 20, 2020, 3:37am

Please any update on this?

ptrblck · October 20, 2020, 9:10am

I tested your code and get stable results, even though I would recommend to accumulate the time for a few hundred iterations and calculate the mean once.
Anyway, for resnet101, this is the result for libtorch 1.6.0:

Iteration : 0 Elapsed time 1837.04
Iteration : 100 Elapsed time 22.734
Iteration : 200 Elapsed time 22.597
Iteration : 300 Elapsed time 22.264
Iteration : 400 Elapsed time 22.724
Iteration : 500 Elapsed time 22.885
Iteration : 600 Elapsed time 22.647
Iteration : 700 Elapsed time 22.73
Iteration : 800 Elapsed time 22.692
Iteration : 900 Elapsed time 22.541
Iteration : 1000 Elapsed time 22.549
Iteration : 1100 Elapsed time 22.788
Iteration : 1200 Elapsed time 22.694
Iteration : 1300 Elapsed time 22.637
Iteration : 1400 Elapsed time 22.744
Iteration : 1500 Elapsed time 22.663
Iteration : 1600 Elapsed time 22.66
Iteration : 1700 Elapsed time 22.785
Iteration : 1800 Elapsed time 22.681
Iteration : 1900 Elapsed time 22.786
Iteration : 2000 Elapsed time 22.668
Iteration : 2100 Elapsed time 22.697
Iteration : 2200 Elapsed time 22.749
...

Note that the first iteration takes longer, since you are already starting the timer before calling the synchronize op.

Kiki_Rizki_Arpiandi · October 20, 2020, 9:36am

@ptrblck Kindly thank you for your response. What is the possible reasons behind the fluctuation, beside there are another processes on the background?
Is possible that the deep learning model cause such fluctuation?

ptrblck · October 20, 2020, 9:38am

Could be the case, e.g. if the model uses different paths.
If you can post the model definition, I could rerun the script and check the timings on my system.

Kiki_Rizki_Arpiandi · October 20, 2020, 12:17pm

Here is the model :

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import math

import torch
import torch.nn as nn
import torch.utils.model_zoo as model_zoo

BN_MOMENTUM = 0.01
model_urls = {
    'mobilenet_v2': 'https://download.pytorch.org/models/mobilenet_v2-b0353104.pth',
}


def conv_bn(inp, oup, stride):
    conv_3x3 = nn.Sequential(
        nn.Conv2d(inp, oup, 3, stride, 1, bias=False),
        nn.BatchNorm2d(oup),
        nn.ReLU(inplace=True)
    )
    for m in conv_3x3.modules():
        if isinstance(m, nn.Conv2d):
            nn.init.kaiming_normal_(m.weight, mode='fan_out')
            if m.bias is not None:
                nn.init.zeros_(m.bias)
        elif isinstance(m, nn.BatchNorm2d):
            nn.init.ones_(m.weight)
            nn.init.zeros_(m.bias)
    return conv_3x3


def conv_1x1_bn(inp, oup):
    conv1x1 = nn.Sequential(
        nn.Conv2d(inp, oup, 1, 1, 0, bias=False),
        nn.BatchNorm2d(oup),
        nn.ReLU(inplace=True))
    for m in conv1x1.modules():
        if isinstance(m, nn.Conv2d):
            nn.init.kaiming_normal_(m.weight, mode='fan_out')
            if m.bias is not None:
                nn.init.zeros_(m.bias)
        elif isinstance(m, nn.BatchNorm2d):
            nn.init.ones_(m.weight)
            nn.init.zeros_(m.bias)
    return conv1x1


def deconv_bn_relu(in_channels, out_channels, kernel_size, padding, output_padding, bias):
    deconv = nn.Sequential(
        nn.ConvTranspose2d(
            in_channels=in_channels,
            out_channels=out_channels,
            kernel_size=kernel_size,
            stride=2,
            padding=padding,
            output_padding=output_padding,
            bias=bias),
        nn.BatchNorm2d(out_channels, momentum=BN_MOMENTUM),
        nn.ReLU(inplace=True)
    )
    for m in deconv.modules():
        if isinstance(m, nn.ConvTranspose2d):
            fill_up_weights(m)
            if m.bias is not None:
                nn.init.zeros_(m.bias)
        elif isinstance(m, nn.BatchNorm2d):
            nn.init.ones_(m.weight)
            nn.init.zeros_(m.bias)
    return deconv


def fill_fc_weights(layers):
    for m in layers.modules():
        if isinstance(m, nn.Conv2d):
            if m.bias is not None:
                nn.init.constant_(m.bias, 0)


def fill_up_weights(up):
    w = up.weight.data
    f = math.ceil(w.size(2) / 2)
    c = (2 * f - 1 - f % 2) / (2. * f)
    for i in range(w.size(2)):
        for j in range(w.size(3)):
            w[0, 0, i, j] = \
                (1 - math.fabs(i / f - c)) * (1 - math.fabs(j / f - c))
    for c in range(1, w.size(0)):
        w[c, 0, :, :] = w[0, 0, :, :]


class InvertedResidual(nn.Module):
    def __init__(self, inp, oup, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        self.stride = stride
        assert stride in [1, 2]

        hidden_dim = round(inp * expand_ratio)
        self.use_res_connect = self.stride == 1 and inp == oup

        if expand_ratio == 1:
            self.conv = nn.Sequential(
                # dw
                nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False),
                nn.BatchNorm2d(hidden_dim),
                nn.ReLU(inplace=True),
                # pw-linear
                nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup),
            )
        else:
            self.conv = nn.Sequential(
                # pw
                nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False),
                nn.BatchNorm2d(hidden_dim),
                nn.ReLU(inplace=True),
                # dw
                nn.Conv2d(hidden_dim, hidden_dim, 3, stride, 1, groups=hidden_dim, bias=False),
                nn.BatchNorm2d(hidden_dim),
                nn.ReLU(inplace=True),
                # pw-linear
                nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
                nn.BatchNorm2d(oup),
            )

    def forward(self, x):
        if self.use_res_connect:
            return x + self.conv(x)
        else:
            return self.conv(x)


class MobileNetv2Det(nn.Module):
    def __init__(self, heads, head_conv, width_mult=1., is_train=True):
        super(MobileNetv2Det, self).__init__()
        self.inplanes = 32
        self.last_channel = 64  # backbone
        self.deconv_with_bias = False
        self.is_train = is_train
        self.heads = heads

        block = InvertedResidual
        interverted_residual_setting = [
            # t, c, n, s
            [1, 16, 1, 1],
            [6, 24, 2, 2],
            [6, 32, 3, 2],
            [6, 64, 4, 2],
            [6, 96, 3, 1],
            [6, 160, 3, 2],
            [6, 320, 1, 1],
        ]

        # build backbone
        # building first layer
        # assert input_size % 32 == 0
        input_channel = int(self.inplanes * width_mult)
        self.last_channel = int(self.last_channel * width_mult) if width_mult > 1.0 else self.last_channel
        self.features = [conv_bn(3, input_channel, 2)]
        # building inverted residual blocks
        for t, c, n, s in interverted_residual_setting:
            output_channel = int(c * width_mult)
            for i in range(n):
                if i == 0:
                    self.features.append(block(input_channel, output_channel, s, expand_ratio=t))
                else:
                    self.features.append(block(input_channel, output_channel, 1, expand_ratio=t))
                input_channel = output_channel

        # make it nn.Sequential
        self.features = nn.Sequential(*self.features)

        # building last several layers
        self.backbone_lastlayer = conv_1x1_bn(input_channel, self.last_channel)

        self.ups = []
        for i in range(3):
            up = deconv_bn_relu(self.last_channel, self.last_channel, 2, 0, 0, self.deconv_with_bias)
            self.ups.append(up)
        self.ups = nn.Sequential(*self.ups)

        self.conv_dim_matchs = []
        self.conv_dim_matchs.append(conv_1x1_bn(96, self.last_channel))
        self.conv_dim_matchs.append(conv_1x1_bn(32, self.last_channel))
        self.conv_dim_matchs.append(conv_1x1_bn(24, self.last_channel))
        self.conv_dim_matchs = nn.Sequential(*self.conv_dim_matchs)

        self.last_context_conv = conv_bn(self.last_channel, self.last_channel, 1)

        for head in sorted(self.heads):
            num_output = self.heads[head]


            if head_conv > 0:
                fc = nn.Sequential(
                    nn.Conv2d(self.last_channel, head_conv,
                              kernel_size=3, padding=1, bias=True),
                    nn.ReLU(inplace=True),
                    nn.Conv2d(head_conv, num_output,
                              kernel_size=1, stride=1, padding=0))
                if 'hm' in head:
                    fc[-1].bias.data.fill_(-2.19)
                else:
                    fill_fc_weights(fc)
            else:
                fc = nn.Conv2d(
                    in_channels=self.last_channel,
                    out_channels=num_output,
                    kernel_size=1,
                    stride=1,
                    padding=0
                )
                if 'hm' in head:
                    fc.bias.data.fill_(-2.19)
                else:
                    fill_fc_weights(fc)
            self.__setattr__(head, fc)

    def init_weights(self, pretrained=True):
        if pretrained:
            for head in self.heads:
                final_layer = self.__getattr__(head)
                for i, m in enumerate(final_layer.modules()):
                    if isinstance(m, nn.Conv2d):
                        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                        if m.weight.shape[0] == self.heads[head]:
                            if 'hm' in head:
                                nn.init.constant_(m.bias, -2.19)
                            else:
                                nn.init.normal_(m.weight, std=0.001)
                                nn.init.constant_(m.bias, 0)
            url = model_urls['mobilenet_v2']
            pretrained_state_dict = model_zoo.load_url(url)
            print('=> loading pretrained model {}'.format(url))
            self.features.load_state_dict(pretrained_state_dict, strict=False)
        else:
            for m in self.modules():
                if isinstance(m, nn.Conv2d):
                    nn.init.kaiming_normal_(m.weight, mode='fan_out')
                    if m.bias is not None:
                        nn.init.zeros_(m.bias)
                elif isinstance(m, nn.BatchNorm2d):
                    nn.init.ones_(m.weight)
                    nn.init.zeros_(m.bias)
                elif isinstance(m, nn.Linear):
                    nn.init.normal_(m.weight, 0, 0.01)
                    nn.init.zeros_(m.bias)
                elif isinstance(m, nn.ConvTranspose2d):
                    fill_up_weights(m)
                    if m.bias is not None:
                        nn.init.zeros_(m.bias)

    def forward(self, x):
        xs = []

        for n in range(0, 4):
            x = self.features[n](x)
        xs.append(x)

        for n in range(4, 7):
            x = self.features[n](x)
        xs.append(x)

        for n in range(7, 14):
            x = self.features[n](x)
        xs.append(x)

        for n in range(14, 18):
            x = self.features[n](x)

        x = self.backbone_lastlayer(x)

        for i in range(3):
            x = self.ups[i](x)
            x = x + self.conv_dim_matchs[i](xs[3 - i - 1])

        x = self.last_context_conv(x)
        if self.is_train == True:
            ret = {}
            for head in self.heads:
                ret[head] = self.__getattr__(head)(x)
            return [ret]
        else:
            ret = []
            for head in self.heads:
                ret.append(self.__getattr__(head)(x))
            return torch.cat(ret, 1)


def get_mv2relu_net(num_layers, heads, head_conv, is_train):
    model = MobileNetv2Det(heads, head_conv=head_conv, width_mult=1.0, is_train=is_train)
    model.init_weights()
    return model


def create_model(heads={'hm': 1, 'wh': 2, 'reg': 2}, head_conv=64, is_train=True):
    model = get_mv2relu_net(num_layers=0, heads=heads, head_conv=head_conv, is_train=is_train)
    return model


def demo():
    model = create_model(is_train=False).cuda()
    model.init_weights()

    model = model.eval()
    model.is_train = False
    model.eval()

    random_sample = torch.rand(2, 3, 640, 640).cuda()
    out = model(random_sample)


    traced_module = torch.jit.trace(model, random_sample)
    traced_module.save("traced_model.pt")


if __name__ == '__main__':
    demo()

ptrblck · October 20, 2020, 7:03pm

I cannot trace the model and get:

RuntimeError: Only tensors, lists, tuples of tensors, or dictionary of tensors can be output from traced functions

and scripting it yields:

RuntimeError: 
Expected integer literal for index:
  File "<ipython-input-194-4d44670bb81d>", line 254
        
        for n in range(0, 4):
            x = self.features[n](x)
                ~~~~~~~~~~~~~~~~ <--- HERE
        xs.append(x)

Kiki_Rizki_Arpiandi · October 21, 2020, 4:16am

I am sorry, my bad @ptrblck. I have edited the code it should works now

ptrblck · October 21, 2020, 6:17am

Thanks, the model timing is also table for this model:

Iteration : 0 Elapsed time 352.787
Iteration : 100 Elapsed time 6.162
Iteration : 200 Elapsed time 6.17
Iteration : 300 Elapsed time 6.19
Iteration : 400 Elapsed time 6.197
Iteration : 500 Elapsed time 6.204
Iteration : 600 Elapsed time 6.169
Iteration : 700 Elapsed time 6.193
Iteration : 800 Elapsed time 6.168
Iteration : 900 Elapsed time 6.128
Iteration : 1000 Elapsed time 6.209
Iteration : 1100 Elapsed time 6.183
Iteration : 1200 Elapsed time 6.17
Iteration : 1300 Elapsed time 6.18
Iteration : 1400 Elapsed time 6.181
Iteration : 1500 Elapsed time 6.177
Iteration : 1600 Elapsed time 6.175
Iteration : 1700 Elapsed time 6.186
Iteration : 1800 Elapsed time 6.185
Iteration : 1900 Elapsed time 6.18
Iteration : 2000 Elapsed time 6.136
Iteration : 2100 Elapsed time 6.192
Iteration : 2200 Elapsed time 6.158
Iteration : 2300 Elapsed time 6.15
Iteration : 2400 Elapsed time 6.14
Iteration : 2500 Elapsed time 6.166
Iteration : 2600 Elapsed time 6.169
Iteration : 2700 Elapsed time 6.178
Iteration : 2800 Elapsed time 6.177
Iteration : 2900 Elapsed time 6.173
Iteration : 3000 Elapsed time 6.166

(besides the previously mentioned wrongly timed first iteration).