How to replace the 3D convolution by 2D convolutions?

I am implementing the idea of the paper “A Closer Look at Spatiotemporal Convolutions for Action Recognition”. It proposed a way to replace 3D convolution by R(2+1)D convolution which is implemented in CAFFE2. My target has reproduced the result in pytorch. For 3D convolution of 3xtxhxw, where 3 means RGB, t is a number of the frame, h and w is height and width. For R(2+1)D, it will follows two steps:

  1. Convolution with 1xdxd kernel (d is size of kernel, 1 means on single frame)
  2. Apply tx1x1 on the output of the feature map.
    In pytorch, 3D convolution can do as

       self.conv3d = nn.Conv3d(in_channels, out_channels, kernel=3,
                                    stride=1, padding=1, bias=bias)

This is my implementation to equivalent with above

       self.spatial_conv = nn.Conv2d(in_channels, intermed_channels, kernel=3,
                                    stride=1, padding=1, bias=bias) = nn.BatchNorm2d(intermed_channels)
        self.relu = nn.ReLU()
        self.temporal_conv = nn.Conv3d(intermed_channels, out_channels, temporal_kernel_size, 
                                    stride=temporal_stride, padding=temporal_padding, bias=bias)

Am I correct? Thanks