Need some help about my coding depthwise pointwise convolution

Hello. Nice to meet you guys.
I am currently try to make the pytorch version about SDD crack segmentation network.
The paper said about the pointwise first and depthwise after.
I wrote the block like below.

> class depthwise_separable_convs(nn.Module):
>     def __init__(self, nin=64, nout=64, kernel_size, padding, bias=False):
>         super(depthwise_separable_convs, self).__init__()
>         d=64
>         pw_filter_nums=int(d/2)
>         self.pointwise = nn.Sequential(
>         #pointwise
>         nn.Conv2d(nin, pw_filter_nums, kernel_size=3,stride=1, padding=1, groups=pw_filter_nums, bias=bias), 
>         nn.BatchNorm2d(pw_filter_nums),
>         nn.ReLU(inplace=True),
>         #depthwise
>         nn.Conv2d(pw_filter_nums, nout, kernel_size=3, padding=padding, groups=pw_filter_nums, bias=bias),
>         nn.BatchNorm2d(nout),
>         nn.ReLU(inplace=True))
>         
>     def forward(self, x):
>         out = self.pointwise(x)
>         
>         return out

I follow some inquire in this community. I am not sure that the depth wise and pointwise for groups correctly.
However I am sure that there is some error about my code. The training loss is not decrease.
Does anyone give some advice about my code error. I do not understand the group parts correctly. My training code is not the problem I tested this train code for unet and deeplab v3 and it works very well.
Thank you for reading my question.

If I’m not mistaken, a depthwise separable convolution is applying a grouped convolution followed by a pointwise convolution as shown here.
Both your convolutions use a kernel size of 3 (pointwise should use a 1x1 kernel) and both are using different groups (depthwise should use groups=in_channels).

Thank you for answering.
I update the code. After few search also.
The paper what I am trying to rewrite the code use the pointwise first instead of depthwise first.
I also check that the mobilnet v2 also use this approach in some part actually.
Anyway The block is not the problem about network updating look like.
So I am digging the weight initialization. parts. I am not sure I am missing or something.

Thank you.

self.pointwise = nn.Sequential(
            nn.Conv2d(nin, pw, 1, 1, 0, 1, 1, bias=bias),
            nn.BatchNorm2d(pw),
            nn.ReLU(inplace=True),
            nn.Conv2d(pw, pw, kernel_size=3,stride=1, padding=1, groups=pw, bias=bias),
            nn.BatchNorm2d(pw),
            nn.ReLU(inplace=True))