I am implementing the idea of the paper “A Closer Look at Spatiotemporal Convolutions for Action Recognition”. It proposed a way to replace 3D convolution by R(2+1)D convolution which is implemented in CAFFE2. My target has reproduced the result in pytorch. For 3D convolution of 3xtxhxw
, where 3 means RGB, t is a number of the frame, h and w is height and width. For R(2+1)D, it will follows two steps:
- Convolution with 1xdxd kernel (d is size of kernel, 1 means on single frame)
- Apply
tx1x1
on the output of the feature map.
In pytorch, 3D convolution can do as
self.conv3d = nn.Conv3d(in_channels, out_channels, kernel=3,
stride=1, padding=1, bias=bias)
This is my implementation to equivalent with above
self.spatial_conv = nn.Conv2d(in_channels, intermed_channels, kernel=3,
stride=1, padding=1, bias=bias)
self.bn = nn.BatchNorm2d(intermed_channels)
self.relu = nn.ReLU()
self.temporal_conv = nn.Conv3d(intermed_channels, out_channels, temporal_kernel_size,
stride=temporal_stride, padding=temporal_padding, bias=bias)
Am I correct? Thanks