Backpropagating through noise

michaelklachko · October 10, 2018, 10:40pm

I want to add random gaussian noise to my network weights, for every forward pass. When backpropagating, I want to calculate gradients in respect to distorted weights, then update the original weights using those gradients. Am I doing it right in the example below?

class Net(nn.Module):
	def __init__(self):
		super(Net, self).__init__()
		self.linear = nn.Linear(784, 10)
		self.gaussian = Normal(loc=0, scale=torch.ones_like(self.linear.weight))

	def forward(self, x):
		orig_weight = self.linear.weight.clone()
		noise = self.gaussian.sample()
		self.linear.weight.data = self.linear.weight.data + noise
		x = self.linear(x)
		self.linear.weight.data = orig_weight.data
		return x

model = Net()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
model.train()

for epoch in range(10):
	for i in range(100):
		output = model(input)
		loss = nn.CrossEntropyLoss()(output, label)
		optimizer.zero_grad()
		loss.backward()
		optimizer.step()

ptrblck · October 13, 2018, 2:00am

I’ve debugged your code and it seems to do exactly what you wish to achieve.
I have to say I don’t really like the usage of .data in general, but this might be a valid use case.
At least I’m not sure how to make it better without manipulating the linear implementation.
Maybe someone else will have a good idea.

michaelklachko · October 13, 2018, 10:30pm

I appreciate your help! Initially I was surprised by your answer, because shortly after I posted my question, I realized that the (more) correct way to do it is like this:

class Net(nn.Module):
	def __init__(self):
		super(Net, self).__init__()
		self.linear = nn.Linear(784, 10)

	def forward(self, x):
		return self.linear(x)

model = Net()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
model.train()

for epoch in range(10):
	for i in range(100):

             orig_params = []
             for p in model.parameters():
	         orig_params.append(p.clone())
	         gaussian = Normal(loc=0, scale=torch.ones_like(p))
	         p.data = p.data + gaussian.sample()

	     output = model(input)
	     loss = nn.CrossEntropyLoss()(output, label)

             optimizer.zero_grad()
	     loss.backward()

             for p, orig_p in zip(model.parameters(), orig_params):
		 p.data = orig_p.data

	     optimizer.step()

However, now I see that the reason you didn’t see any issues with my initial example is that I simplified it too much. Because there’s no hidden layer to backpropagate through, it does not make any difference. However, if we consider, for example, MLP with a hidden layer:

class Net(nn.Module):
	def __init__(self):
		super(Net, self).__init__()
		self.linear1 = nn.Linear(784, 100)
                self.linear2 = nn.Linear(100, 10)
                self.relu = nn.ReLU()

	def forward(self, x):
		x = self.linear1(x)
                x = self.relu(x)
                x = self.linear2(x)
                return x

Well, now we would have a problem with the method in the first example: if we apply noise to both weight matrices (in linear1 and linear2), then the error would backpropagate through the original linear2 weights, while what we really want is to backpropagate through the distorted linear2 weights, to get the true gradient in respect to distorted weights of the linear1 layer. Do you agree?

By the way, I’m curious, how did you debug/verify the code?

ptrblck · October 13, 2018, 10:54pm

You are right, I assumed you are adding the noise and resetting the weights of all layers before and after the update step respectively.

Well, I just initialized the weights to a constant value, calculated the expected gradients and had a look what happens after adding/resetting the weights and the update step.
It was just a numerical check to see, if the “right” gradients are used as I’m always worried about manipulating .data.

AlbertZeyer · November 11, 2022, 12:02pm

@michaelklachko Was there some paper which motivated you to do this? Can you share it?

I know this under the term “variational weight noise” or “variational parameter noise” but I’m not sure where that term comes from.

Regarding the implementation, I think this can be implemented in a similar way than WeightNorm, right?

michaelklachko · November 11, 2022, 4:40pm

At the time I was working on this paper: [1904.01705] Improving Noise Tolerance of Mixed-Signal Neural Networks

I actually ended up creating custom layers to do this: NoisyNet/hardware_model.py at master · michaelklachko/NoisyNet · GitHub

I’m not sure what is “variational weight noise”, this paper might give you some ideas: [1506.02557] Variational Dropout and the Local Reparameterization Trick