Add noise to the weights to compute the loss while optimizing the original weights without noise added

I am trying to train a model where I want to apply a function to the current model weights and then calculate the loss.
But using this loss, I want to update the original weights.
I am doing something like this. I am unsure if I am achieving what I am trying to do, as the trained model is not optimized if I add the same noise into the trained model.


import torch
import torchvision
from torch import nn

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("Using device: ", device)

def create_network():
    channel = [784, 100, 100, 10]
    model = nn.Sequential(
        nn.Linear(in_features=channel[0], out_features=channel[1]),
        nn.Linear(in_features=channel[1], out_features=channel[2]),
        nn.Linear(in_features=channel[2], out_features=channel[3]),
    return model

# function to apply a transformation to weights
def transform_weights(model):
    for name, param in model.named_parameters():
        if "weight" in name:
            # create a new random tensor with the same size as the weight tensor
            noise = torch.randn(param.shape) * 0.01
   = +
    return model

# Load MNIST dataset
train_dataset = torchvision.datasets.MNIST(
    root="data", train=True, download=True, transform=torchvision.transforms.ToTensor()
test_dataset = torchvision.datasets.MNIST(
    root="data", train=False, download=True, transform=torchvision.transforms.ToTensor()
train_loader =, batch_size=64, shuffle=True)
test_dataset =, batch_size=64, shuffle=False)

model = create_network()  # the model I want to train  # move the model to the GPU
model_orig = create_network()  # The model to to store the wights before adding noise  # move the model to the GPU
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)  # only train model
for epoch in range(1, 10):
    for batch_idx, (images, labels) in enumerate(train_loader):
        images =
        labels =
        # reset gradients
        # drift the weights and compute the forward pass
        model = transform_weights(model)
        loss = criterion(model(images), labels)
        # Run training (backward propagation).
        # Load back the original weights
        # Optimize weights.

        # Calculate the test accuracy
        if batch_idx % 100 == 0:
            correct = 0
            total = 0
            with torch.no_grad():
                for data in test_dataset:
                    images, labels = data
                    images =
                    labels =
                    outputs = model(images)
                    _, predicted = torch.max(, 1)
                    total += labels.size(0)
                    correct += (predicted == labels).sum().item()
                f"Epoch: {epoch} Batch: {batch_idx} Loss: {loss.item()} Accuracy: {100 * correct / total}"


Using device:  cuda:0
Epoch: 1 Batch: 0 Loss: 2.3247387409210205 Accuracy: 10.09
Epoch: 1 Batch: 100 Loss: 1.5586031675338745 Accuracy: 64.34
Epoch: 1 Batch: 200 Loss: 0.8242834210395813 Accuracy: 81.4
Epoch: 1 Batch: 300 Loss: 0.6049119234085083 Accuracy: 86.83
Epoch: 1 Batch: 400 Loss: 0.4129831790924072 Accuracy: 87.14
Epoch: 1 Batch: 500 Loss: 0.4107397794723511 Accuracy: 89.9
Epoch: 1 Batch: 600 Loss: 0.36199185252189636 Accuracy: 90.48
Epoch: 1 Batch: 700 Loss: 0.42539575695991516 Accuracy: 91.31
Epoch: 1 Batch: 800 Loss: 0.3088320195674896 Accuracy: 91.61

Hi Atif!

(Note, data is deprecated and its use is likely to lead to errors.)

I think you are better off not trying to modify Linear.weight. At issue is
that Sequential wants to apply its Modules as-is to its input, but you want
to do something different. So I would recommend defining your own Module
whose forward() method explicitly does what you want – that is, applies
a noisy version of Linear.weight to its input.

It is convenient to use pytorch’s functional version of linear() to do this for

I would suggest something like this:

import torch
print (torch.__version__)

_ = torch.manual_seed (2022)

def add_noise (x):
    return  x + 0.01 * torch.randn_like (x)

class NoisyModel (torch.nn.Module):
    def __init__(self, channel):
        self.lin1 = torch.nn.Linear (in_features=channel[0], out_features=channel[1])
        self.lin2 = torch.nn.Linear (in_features=channel[1], out_features=channel[2])
        self.lin3 = torch.nn.Linear (in_features=channel[2], out_features=channel[3])
    def forward (self, x):
        x = x.flatten (start_dim = 1)
        x = torch.nn.functional.linear (x, add_noise (self.lin1.weight), self.lin1.bias)
        x = x.sigmoid()
        x = torch.nn.functional.linear (x, add_noise (self.lin2.weight), self.lin2.bias)
        x = x.sigmoid()
        x = torch.nn.functional.linear (x, add_noise (self.lin3.weight), self.lin3.bias)
        x = x.softmax (dim = 1)
        return x

model = NoisyModel ([784, 100, 100, 10])
print ('number of model parameters:', len (list (model.parameters())))   # two parameters (weight, bias) for each Linear

input = torch.randn (5, 28, 28)   # batch of five 28x28 images
output = model (input)            # model predicts ten classes per sample

print ('input.shape:',   input.shape, '  output.shape:', output.shape)

Here is the output of the script:

number of model parameters: 6
input.shape: torch.Size([5, 28, 28])   output.shape: torch.Size([5, 10])

As an aside, I do think your scheme of modifying Linear.weight, applying
it, and then restoring the original Linear.weigh could be made to work, but
doing so would be complicated and contrived.


K. Frank

Thanks @KFrank, for looking into it and for your suggestion.

Yes, for now, my scheme of saving the weights and then loading them back is working but it is 8-10 times slower than normal training.
Also, if I use the noisy model to do the forward pass and compute the loss and then use that loss to backpropagate and update my original model.

So my question is can we do that? i.e. to optimize a model based on another model’s output? While the other models closely track the weights of the original model but only adds noise or transform weights somehow in each forward pass.