Why is the inference speed divided by 3 when the batch size is increased by one

Hello,
I am currently using pytorch (1.6.0) to build a graph convolutional network, but when trying my model in inference, I was faced with this issue : on a batch size of 1, the average speed of the model is around 180 it/s, but with a batch size of 2, the speed is about 60 it/s ! I cant find the origin of this behaviour (the allocated GPU memory stays about the same during the whole process). I am wondering if this kind of change is to be expected ? How can it be improved ?

By the way the script I use to measure the speed is the following :

num_kp = 18
   for batch_size in [1, 2]:
       mod = gcn.STGCN(gr, use_attention=False, use_tem=False, num_kpts=num_kp, num_people=num_p).eval().cuda()
       model_input = torch.rand((batch_size, 3, 150, num_kp, 1)).cuda()

       with torch.no_grad():
           for _ in tqdm.tqdm(range(100)):
               mod(model_input)

Thanks !

CUDA operations are asynchronous, so if you want to measure the forward pass you would have to synchronize the code before starting and stopping a timer.
Currently you might be profiling the PyTorch overhead and the kernel launches.

Alright, I measured the forward pass time using torch.cuda.synchronize() this time :

            mean_time = 0
            for _ in tqdm.tqdm(range(1000)):
                torch.cuda.synchronize()
                start_t = time.time()
                mod(model_input)
                torch.cuda.synchronize()
                end_t = time.time()
                mean_time += (end_t - start_t) / 1000

The results are not as bad as before indeed, but the time still goes from 5ms to 12ms between batch sizes 1 and 2, the difference is still big enough to make me wonder if this is to be expected. Do you think this issue is related to my code or cuda related ?

I don’t know. Could you post your model as well as the input shapes, so that I could have a look?

Sure ! The code is quite long, so I won’t post the whole thing here :

  1. The main model class
class STGCN(nn.Module):
    """Spatio temporal graph convolutional network
    """
    def __init__(self, graph, num_kpts=18, in_features=3, num_classes=20, num_people=2, use_attention=True, use_tem=True):
        super(STGCN, self).__init__()
        if isinstance(graph, tuple):
            self.adj = torch.cat([get_normalized_adj(subgraph).unsqueeze(0) for subgraph in graph]).cuda()
        else:
            self.adj = get_normalized_adj(graph).cuda()

        init_filters = 64
        self.b1 = STGC_Block(self.adj, in_features, init_filters, num_kpts=num_kpts, use_attention=use_attention, use_tem=use_tem)
        self.b2 = STGC_Block(self.adj, init_filters, init_filters, num_kpts=num_kpts, use_attention=use_attention, use_tem=use_tem)
        self.b3 = STGC_Block(self.adj, init_filters, init_filters, num_kpts=num_kpts, use_attention=use_attention, use_tem=use_tem)
        self.b4 = STGC_Block(self.adj, init_filters, init_filters * 2, num_kpts=num_kpts, use_attention=use_attention, use_tem=use_tem, stride=2)
        self.b5 = STGC_Block(self.adj, init_filters * 2, init_filters * 2, num_kpts=num_kpts, use_attention=use_attention, use_tem=use_tem)
        self.b6 = STGC_Block(self.adj, init_filters * 2, init_filters * 2, num_kpts=num_kpts, use_attention=use_attention, use_tem=use_tem)
        self.b7 = STGC_Block(self.adj, init_filters * 2, init_filters * 4, num_kpts=num_kpts, use_attention=use_attention, use_tem=use_tem, stride=2)
        self.b8 = STGC_Block(self.adj, init_filters * 4, init_filters * 4, num_kpts=num_kpts, use_attention=use_attention, use_tem=use_tem)
        self.b9 = STGC_Block(self.adj, init_filters * 4, init_filters * 4, num_kpts=num_kpts, use_attention=use_attention, use_tem=use_tem)

        self.bn = nn.BatchNorm1d(num_people * in_features * num_kpts)
        self.classify = nn.Linear(256, num_classes)
        # self.softmax = nn.Softmax(-1)
        self.drop = nn.Dropout(0.3)

    def forward(self, x):
        batch_size, in_feats, seq_len, num_kpts, num_people = x.size()

        x = x.permute(0, 4, 3, 1, 2).contiguous().view(batch_size, num_people * num_kpts * in_feats, seq_len)
        x = self.bn(x)
        x = x.view(batch_size, num_people, num_kpts, in_feats, seq_len)
        x = x.permute(0, 1, 3, 4, 2).contiguous().view(batch_size * num_people, in_feats, seq_len, num_kpts)

        x = self.b1(x)
        x = self.b2(x)
        x = self.b3(x)
        x = self.b4(x)
        x = self.b5(x)
        x = self.b6(x)
        x = self.b7(x)
        x = self.b8(x)
        x = self.b9(x)

        x = x.view(batch_size, num_people, 256, -1)  # shape: batch x peops x 256 x (seq * kpts)
        x = x.mean(-1).mean(1)
        x = self.drop(x)

        return self.classify(x)
  1. The STGC blocks :
class STGC_Block(nn.Module):
    """ST-GCN like model basic block:
        - A graph convolution
        - Optional : an attention layer
        - A temporal convolution
    """
    def __init__(self, adj, in_features, out_features, num_kpts, stride=1, temporal_kernel=9, use_attention=True, use_tem=True):
        super(STGC_Block, self).__init__()
        self.adj = adj
        self.graph_conv = AdaptiveGraphConv(adj, in_features, in_features, out_features, num_kpts)
        self.temp_conv = TemporalConv(out_features, temporal_kernel=temporal_kernel, stride=stride)

        # RESIDUAL CONNECTION
        if in_features != out_features and stride == 1:
            self.residual_connection = nn.Sequential(
                nn.Conv2d(in_features, out_features, kernel_size=1, stride=stride),
                nn.BatchNorm2d(out_features)
            )
        elif in_features == out_features and stride == 1:
            self.residual_connection = lambda input_features: input_features
        else:
            self.residual_connection = TemporalConv(in_features, out_features, temporal_kernel, stride)

        self.relu = nn.ReLU(inplace=True)

    def forward(self, features):
        x = self.graph_conv(features)

        x = self.temp_conv(x)

        x += self.residual_connection(features)
        x = self.relu(x)
        return x

The adaptive graph-conv is basically made of two torch.matmul and a nn.Conv2D(kernel=1), whereas the temporal-conv is only made of a conv2d layer (kernel = 9x1).

The input I used to measure the time performances is a torch.rand tensor with dimensions (batch_size, 3, 150, 18, 1)

graph as well as get_normalized_adj is undefined, so that I cannot run it, unfortunately. :confused: