Model.cuda() results in a different output compared to when not used

I’ve been scratching my head for a while now.
Env: Python 2.7; pytorch 1.8 + cuda

class Model(nn.Module):
    def __init__(self, feature_extractor, dropout=0, pretrained=True, feat_dim=2048):
        self.dropout = dropout
        self.feature_extractor = feature_extractor
        self.feature_extractor.avgpool = nn.AdaptiveAvgPool2d(1)
        fe_out_planes = self.feature_extractor.fc.in_features
        self.feature_extractor.fc = nn.Linear(fe_out_planes, feat_dim)
        self.fc_t = nn.Linear(feat_dim, 3)
        self.fc_q = nn.Linear(feat_dim, 3)

        # initialize the model
        if pretrained:
            init_modules = [self.feature_extractor.fc, self.fc_xyz, self.fc_wpqr]
            init_modules = self.modules()
        for m in init_modules:
            if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
                nn.init.constant_(, 0.01) # constant weights
                if m.bias is not None:
                    nn.init.constant_(, 0)

    def forward(self, x):
        s = x.size()
        x = x.view(-1, *s[2:])
        x = self.feature_extractor(x)
        x = F.relu(x)
        """if self.dropout > 0:
            x = F.dropout(x, p=self.dropout)"""
        t = self.fc_t(x)
        q = self.fc_q(x)
        out =, q), 1)
        out = out.view(s[0], s[1], -1)
        return out

if torch.cuda.is_available():

feature_extractor = models.resnet34(pretrained=True)
model = Model(feature_extractor, dropout=0, feat_dim=2048)

Now, the interesting part:
When I run,

print('Feed a random batch to test the model: ')
input = torch.ones(1, 64, 3, 7, 7)*0.3
input = input.cuda()
output = model(input)

tensor([[106.7721, 106.7721, 106.7721, 106.7721, 106.7721, 106.7721], … ]], device=‘cuda:0’)

Compared to when I run without model.cuda(), i.e.:

print('Feed a random batch to test the model: ')
input = torch.ones(1, 64, 3, 7, 7)*0.3
output = model(input)

tensor([[53.3860, 53.3860, 53.3860, 53.3860, 53.3860, 53.3860],…])

Almost half the values and this is consistent with different inputs.

I actually discovered this when I was porting the original repo to Python 3 (v3.8 with same version of pytorch) and I tried comparing the outputs with same input data, same constant weight init, no shuffle, no dropout with model.eval() but I noticed different outputs.
Then after some many print statements, I noticed this. Interestingly, this is not the case in the Python 3 version i.e., running the script by putting the model and data into gpu by using model.cuda() and input = input.cuda() gives the same output as compared to when .cuda() is not used.

I am really confused as to what the issue is and I couldn’t find any relevant documentation regarding this.
Please help.

Thank you.

That’s an interesting finding, but you should also note that PyTorch dropped the Python2.x support when its end of life was triggered (Jan 2020), so I would recommend to update to Python3.x.