Why the results is different when the init function in nn.modulehas been modifed?

Anikily · April 9, 2022, 11:58am

I defined a neural network with the init and forward function,

class Model(nn.Module)
    def __init__():
        self.sub1 = Module1
        self.sub2 = Module2

when i add some layers such as Conv2d into the self.sub1 or self.sub2, i found the performance after one epoch is different!
I can’t figure out the reason above, could you help me?

AlphaBetaGamma96 · April 9, 2022, 2:54pm

Can you share a minimal reproducible example and explain what this “difference” in behavior is?

ptrblck · April 9, 2022, 8:30pm

Additionally to what @AlphaBetaGamma96 already asked:
if you are concerned about the reproducibility of your code and determinism, note that a layer initialization will call into the pseudorandom number generator and will thus change all following calls (even if the layer is not used in the forward).

Anikily · April 14, 2022, 10:07am

Hi, thanks for you reply, i have a simple code here:

import torch
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
import os

class conv_module(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)

    def forward(self,x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        return x
class mlp_module(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = conv_module()
        self.mlp = mlp_module()

    def forward(self, x):
        x = self.conv(x)
        x = self.mlp(x)
        return x
# seeds
random.seed(int(1024))
os.environ['PYTHONHASHSEED'] = str(1024)
np.random.seed(int(1024))
torch.manual_seed(int(1024))
torch.cuda.manual_seed(int(1024))
torch.cuda.manual_seed_all(int(1024))
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.enabled = False

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 4

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# get some random training images
dataiter = iter(trainloader)

net = Net().cuda()

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs.cuda())
        loss = criterion(outputs, labels.cuda())
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
            running_loss = 0.0

print('Finished Training')

############# Test ########################
correct = 0
total = 0
# since we're not training, we don't need to calculate the gradients for our outputs
with torch.no_grad():
    for data in testloader:
        images, labels = data
        # calculate outputs by running images through the network
        outputs = net(images.cuda())
        # the class with the highest energy is what we choose as prediction
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels.cuda()).sum().item()

print(f'Accuracy of the network on the 10000 test images: {100 * correct // total} %')

correct_pred = {classname: 0 for classname in classes}
total_pred = {classname: 0 for classname in classes}

# again no gradients needed
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images.cuda())
        _, predictions = torch.max(outputs, 1)
        # collect the correct predictions for each class
        for label, prediction in zip(labels.cuda(), predictions):
            if label == prediction:
                correct_pred[classes[label]] += 1
            total_pred[classes[label]] += 1


# print accuracy for each class
for classname, correct_count in correct_pred.items():
    accuracy = 100 * float(correct_count) / total_pred[classname]
    print(f'Accuracy for class: {classname:5s} is {accuracy:.1f} %')

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

this code modified from the pytorch tutorials.
And when i add a conv2d layer into the conv_module such as

class conv_module(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.conv_test = nn.Conv2d(3, 3, 1)

    def forward(self,x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        return x

The performance is different with original model.

Anikily · April 14, 2022, 10:10am

Thanks for your reply.
I have propose a simple example for my case as in above, and there is no initialization at all except the default func in the nn.Module.

ptrblck · April 14, 2022, 5:07pm

The “default” initialization will call into .reset_parameters() and then also into the pseudorandom number generator.
As already described, you are changing the PRNG and thus cannot expect to get the same random values without re-seeding.

Anikily · April 20, 2022, 6:16am

Hi,
Thanks for your reply, which helps me a lot.
But how can i set the seed in PRNG to ensure the reproducibility of my code?
I have set the numpy seed, torch seed and cuda seed in the example, all them seems does not wok.

Best,
Yu

ptrblck · April 20, 2022, 7:25am

I think the seeds do work, but since you are changing the order of calls into the PRNG you cannot expect to see the same results.
If you want to ignore the additional calls to the newly initialized layers, you could try to re-seed the code.

Anikily · April 20, 2022, 8:03am

Sorry, could you specify the meaning of “re-seed”?
Do you mean set the seeds within the network init function?
Looking forward your reply.

ptrblck · April 21, 2022, 4:08am

I mean you would have to re-seed the script at the point where you can guarantee the same order or calls into the PRNG.

Seeding the PRNG guarantees that the sequence of randomly generated numbers will be equal between runs for the same order of calls into the PRNG.
In your case you are changing the order of calls into the PRNG by initializing additional layers in your script, but expect to see the same random numbers, which is a wrong expectation.

Here is a small example:

# initial seed
torch.manual_seed(seed1)

# init layers in both scripts
layer1 = nn.Linear(...) # will have the same parameters in both scripts as you've seeded the code
layer2 = nn.Linear(...)

# init layer3 only in second script
layer3 = nn.Linear(...) # does not exist in first script!

# continue with random operations in both scripts
x = torch.randn(...) # !!! will NOT yield the same random numbers, since the calls into the PRNG diverged
# if you want to sample the same values, re-seed the code
torch.manual_seed(seed2)
x = torch.randn(...) # will have the same random numbers in both scripts