Self-defined model VS torchvision.models

I manually defined a DenseNet as follows, the core code is from torchvision.models.densenet121, the others are kept same. My denseNet definition:

class DenseNet(nn.Module):
    r"""Densenet-BC model class, based on
    `"Densely Connected Convolutional Networks" <https://arxiv.org/pdf/1608.06993.pdf>`_

    Args:
        growth_rate (int) - how many filters to add each layer (`k` in paper)
        block_config (list of 4 ints) - how many layers in each pooling block
        num_init_features (int) - the number of filters to learn in the first convolution layer
        bn_size (int) - multiplicative factor for number of bottle neck layers
          (i.e. bn_size * k features in the bottleneck layer)
        drop_rate (float) - dropout rate after each dense layer
        num_classes (int) - number of classification classes
    """

    def __init__(self, growth_rate=32, block_config=(6, 12, 24, 16),
                 num_init_features=64, bn_size=4, drop_rate=0, num_classes=198):

        super(DenseNet, self).__init__()

        # First convolution
        self.features = nn.Sequential(OrderedDict([
            ('conv0', nn.Conv2d(3, num_init_features, kernel_size=7, stride=2, padding=3, bias=False)),
            ('norm0', nn.BatchNorm2d(num_init_features)),
            ('relu0', nn.ReLU(inplace=True)),
            ('pool0', nn.MaxPool2d(kernel_size=3, stride=2, padding=1)),
        ]))

        # Each denseblock
        num_features = num_init_features
        for i, num_layers in enumerate(block_config):
            block = _DenseBlock(num_layers=num_layers, num_input_features=num_features,
                                bn_size=bn_size, growth_rate=growth_rate, drop_rate=drop_rate)
            self.features.add_module('denseblock%d' % (i + 1), block)
            num_features = num_features + num_layers * growth_rate
            if i != len(block_config) - 1:
                trans = _Transition(num_input_features=num_features, num_output_features=num_features // 2)
                self.features.add_module('transition%d' % (i + 1), trans)
                num_features = num_features // 2

        # Final batch norm
        self.features.add_module('norm5', nn.BatchNorm2d(num_features))

        # Linear layer
        self.classifier = nn.Linear(num_features, num_classes)

        # Official init from torch repo.
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        features = self.features(x)
        out = F.relu(features, inplace=True)
        feat = F.adaptive_avg_pool2d(out, (1, 1)).view(features.size(0), -1)
        out = self.classifier(feat)

        return feat.view(-1, 1024), out

By referring to PyTorch tutorial transfer learning chapter. I also try this without ImageNet weights:

densenet121 = models.densenet121(pretrained=False)
num_ftrs = densenet121.classifier.in_features
densenet121.classifier = nn.Linear(num_ftrs, 198)

After training, I got two quite different results. The latter one has 60%+ Acc, while the former one only has 20%+ Acc. I don’t know how to figure this out, since they are the same model. Is there anything wrong in my implementation?

why dont you just train your model to a smaller subset of the dataset to achieve %100 accuracy. If it reaches that then you probably don’t have bugs on forward and backward pass.

THX for replying. The SOTA algorithm on this benchmark is about 63%. So there is no need to conduct extra experiments on smaller dataset.

I mean, Try to detect buggy implementations with train on a smaller subset of the dataset. If result reaches %100 accuracy then everything works fine with respect to forward and backward pass.
Reproducing results are somewhat different. You need to tune model hyperparameters. My advice is to start with learning rate

What are both accuracies referring to?
Is the 60% acc based on your custom implementation and the 20% using the torchvision model?

Are these results reproducible using random seeds?