LR-ASPP implementation mobilenet v3

I am implementing the mobilenet v3 paper and face the error.
In the paper, the figure 10 shows that after 16 stride they use the average pooling
`

import torch
import torch.nn as nn
import torch.nn.functional as F
class _LRASPP(nn.Module):
“”“Lite R-ASPP”“”

def __init__(self, in_channels, norm_layer, **kwargs):
    super(_LRASPP, self).__init__()
    out_channels = 128
    self.b0 = nn.Sequential(
        nn.Conv2d(in_channels, out_channels, 1, bias=False),
        norm_layer(out_channels),
        nn.ReLU(True)
    )
    self.b1 = nn.Sequential(
        nn.AvgPool2d(kernel_size=(49, 49), stride=(16, 20)),  # check it
        nn.Conv2d(in_channels, out_channels, 1, bias=False),
        nn.Sigmoid(),
    )

def forward(self, x):
    size = x.size()[2:]
    feat1 = self.b0(x)
    feat2 = self.b1(x)
    feat2 = F.interpolate(feat2, size, mode='bilinear', align_corners=True)
    x = feat1 * feat2  # check it
    return x

if name == ‘main’:
net = _LRASPP(in_channels=576,norm_layer=nn.BatchNorm2d)

input_size=(1, 576,64, 32)
net=net.cuda()

x = torch.randn(input_size)
x=x.cuda()

net.eval()
for i in range(10):
    out = net(x)
print(out.shape)

`
However I face the error about
RuntimeError: Given input size: (576x64x32). Calculated output size: (576x1x0). Output size is too small
In the paper, they use the 1024 x 512 size input. Which mean, their final output is 64 32 after 16 stride.
My question is how can we apply the 49 x 49 average pooling?
If I use the stride 8, of course I can apply it but the paper said they use 16 strides or 32 strides with or without LR ASPP. So It does not make any sense. Also no extra information is exists. Any idea for this?
I tested the pytorch 1.0.0 and this code is working but 1.6 verstion this code has error.
Still pooling operation has issus about it.