Try to translate MXNet architecture to Pytorch

Hi member,

I am trying to translate a Resnet modified model architecture written on Mxnet to Pytorch. Below is the architecture:

SegmentationNetwork(
(cnn): HybridSequential(
(0): Conv2D(1 → 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
(2): Activation(relu)
(3): MaxPool2D(size=(3, 3), stride=(2, 2), padding=(1, 1), ceil_mode=False, global_pool=False, pool_type=max, layout=NCHW)
(4): HybridSequential(
(0): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(64 → 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
(2): Activation(relu)
(3): Conv2D(64 → 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
)
)
(1): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(64 → 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
(2): Activation(relu)
(3): Conv2D(64 → 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
)
)
(2): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(64 → 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
(2): Activation(relu)
(3): Conv2D(64 → 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=64)
)
)
)
(5): HybridSequential(
(0): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(64 → 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
(2): Activation(relu)
(3): Conv2D(128 → 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
)
(downsample): HybridSequential(
(0): Conv2D(64 → 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
)
)
(1): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(128 → 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
(2): Activation(relu)
(3): Conv2D(128 → 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
)
)
(2): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(128 → 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
(2): Activation(relu)
(3): Conv2D(128 → 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
)
)
(3): BasicBlockV1(
(body): HybridSequential(
(0): Conv2D(128 → 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
(2): Activation(relu)
(3): Conv2D(128 → 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
)
)
)
(6): HybridSequential(
(0): Flatten
(1): Dense(None → 64, Activation(relu))
(2): Dropout(p = 0.5, axes=())
(3): Dense(None → 64, Activation(relu))
(4): Dropout(p = 0.5, axes=())
(5): Dense(None → 4, Activation(sigmoid))
)
)
)

I can understand the architecture but the last block quite confusing.

(6): HybridSequential(
(0): Flatten
(1): Dense(None → 64, Activation(relu))
(2): Dropout(p = 0.5, axes=())
(3): Dense(None → 64, Activation(relu))
(4): Dropout(p = 0.5, axes=())
(5): Dense(None → 4, Activation(sigmoid))
)

How to translate None as an input to Pytorch nn.Linear? It is the input dimension as I know but how come input has None dimension? The other one is the downsampling part:

(downsample): HybridSequential(
(0): Conv2D(64 → 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
)

The previous block is

(body): HybridSequential(
(0): Conv2D(64 → 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
(2): Activation(relu)
(3): Conv2D(128 → 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)
)

How come the input to the downsampling block is not same as the output from the previous block?

and below is the scrrpt used to produce the last block:

output.add(gluon.nn.Flatten())
output.add(gluon.nn.Dense(64, activation=‘relu’))
output.add(gluon.nn.Dropout(p_dropout))
output.add(gluon.nn.Dense(64, activation=‘relu’))
output.add(gluon.nn.Dropout(p_dropout))
output.add(gluon.nn.Dense(4, activation=‘sigmoid’))

Please advise