Why does gradients change every time?

Dear all,

I have a trained model, and I’m trying to retrieval gradients of the output w.r.t some new input. I set the input.requires_grad = True in advance. Now I’m using autograd.grad(output, input, retain_graph=True) to get gradients. However, with the same code, if I ran multiple times, I get slightly different result every time. And after a certain number of runs, the result stopped changing. Why is this happening? Am I doing something wrong? Thank you!

As far as I understand you are not updating any weights, just computing the gradients w.r.t. the input?
Are you using BatchNorm layers? They might be updated in the first few iterations and then converge to the input mean and std.

Yes I’m just computing gradients. There is no updating. I don’t have BatchNorm layers. The pipeline is pretty straightforward. Since the model is trained on minibatch of 25, I first copy the new input 25 times and create a batch matrix, then in the output, I only extract the first data in the batch.

I’m running into a similar issue with backward() - the gradients change every time I run it, and I’m sure that nothing else is changing in my code (no randomization etc.). Furthermore, this seems to occur only when I’m using classes (object-oriented code), and not when I’m using procedural code. I’m using version 0.4.0.

Could you post an executable code snippet so that I could try to reproduce this issue?

Thank you, @ptrblck! I wish I was able to attach my code but looks like I can’t. After much effort, I simplified my code as much as possible but it still comes to around 70 lines. If you could still take a look and let me know what’s going on, I’ll really appreciate it.

Following is the summary: I have two main programs, which I’ll call testgrad.py and testgradproc.py. Both have essentially the same code - the only difference is that the first program is object-oriented while the second is procedural. The main problem I’m facing is the following. Both programs print the gradients namely ‘TotalWGrad’ and ‘TotalBGrad’ at the end. If you run these programs a few times, which you can do with runtestgrad.py and runtestgradproc.py, you will notice that the ‘TotalWGrad’ can be different some times under testgrad but not under testgradproc. I can’t explain why! My suspicion is that in the object-oriented version, the variables are ‘global’ to the object, and it may have something to do with that… By the way, I tested this on Pytorch 0.4.1. in Windows, and 0.4.0 in RHEL.

FILE 1: testgrad.py

import torch
from torch.autograd import Variable
import numpy as np
import math

class SimBEN():

    def __init__(self):  # these parameters depend on the experiment
        self.NumSimIters = [10,30,30]
        self.NumInputStates = 1
        self.num_ions = 3
        self.num_cells = 3
        self.NumTargetCells = 2
        self.NumEdges = 3
        self.Learn = True
        self.NumLearningIters = 1
        if self.Learn:
            self.NumIters = self.NumLearningIters
            self.Weights = Variable(torch.ones(self.NumEdges)*0.5,requires_grad=True)  # for the more nonlinear version of simulate
            self.Bias = Variable(torch.ones(self.num_cells)*0.5,requires_grad=True)  # for the more nonlinear version of simulate
        self.EdgeList = np.array([[0,2],[0,1],[1,2]])
        self.W = torch.zeros(self.num_cells*self.num_cells).view(self.num_cells,self.num_cells)
        self.W[self.EdgeList[:,1],self.EdgeList[:,0]] = self.Weights.clone()
        self.B = self.Bias.clone()
        self.z_array = torch.FloatTensor([1,1,-1]).view(-1,1)
        self.vm = torch.zeros(self.NumInputStates,self.num_cells)
        self.vm[:,0:self.NumTargetCells] = torch.FloatTensor([-0.08,0.08])
        self.cc_sig = torch.tensor([0.5]).repeat(self.NumInputStates,self.num_cells)
        self.cc_sig[:,0:self.NumTargetCells] = torch.FloatTensor([0.0,1.0])
        self.Dm_array = torch.zeros(self.NumInputStates,self.num_ions,self.num_cells)
        self.Dm_array[:,0,:] = 1.0e-18
        self.Dm_array[:,1,:] = 1.0e-18
        self.Dm_array[:,2,:] = 0.0e-18
        self.cc_cells = torch.zeros(self.NumInputStates,self.num_ions,self.num_cells)
        self.cc_cells[:,0,:] = 10
        self.cc_cells[:,1,:] = 125
        self.cc_cells[:,2,:] = 135
        self.vm_scale = 1.0
        self.F = 96485 # Faraday constant [C/mol]
        self.cm = 0.05  # patch capacitance of membrane [F/m2]
        self.cell_r = 5.0e-6
        self.cell_sa = (4*math.pi*self.cell_r**2) # cell surface area
        self.cell_vol = ((4/3)*math.pi*self.cell_r**3) # cell volume
        self.OutputVolts = torch.FloatTensor([-0.08,0.08])

    def Simulate_nonlinear(self,NumSimIters):    # NumSimIters is a local variable
        self.vm_time = torch.FloatTensor([])
        self.z_batch = self.z_array.clone().repeat(self.NumInputStates,1,1)  # new shape = (p,m,1)
        for t in np.arange(0,NumSimIters):
            self.vm_time = torch.cat((self.vm_time, self.vm))
            self.cc_sig = self.cc_sig + self.B*0.002 + torch.sum(self.W)*0.001
            Dc_prop = self.cc_sig.clone()
            self.Dm_array = self.Dm_array * Dc_prop
            self.cc_cells = self.cc_cells + self.Dm_array*0.001 + torch.sum(self.W)*0.001
            rho = torch.sum(self.cc_cells*self.z_batch,dim=1,keepdim=False)*self.F 
            self.vm = (1/self.cm)*rho*(self.cell_vol/self.cell_sa)
        self.vm_time = self.vm_time.view(NumSimIters,self.NumInputStates,self.num_cells) 

    def MSELoss(self,vm_list,Outputs):
        target_vm = Outputs
        target_vm = target_vm * self.vm_scale
        self.err = (vm_list - target_vm).pow(2).mean() 

    def MasterSim(self):
        self.TotalWGrad = torch.zeros(1).resize_as_(self.Weights)
        self.TotalBGrad = torch.zeros(1).resize_as_(self.Bias)
        self.Simulate_nonlinear(self.NumSimIters[0])
        self.MSELoss(self.vm_time[-10:,:,0:self.NumTargetCells],self.OutputVolts[0])
        self.err0 = self.err.clone()
        if self.Learn:
            self.err.backward(retain_graph=True)
            self.TotalWGrad = self.TotalWGrad + self.Weights.grad
            self.TotalBGrad = self.TotalBGrad + self.Bias.grad
            print(self.TotalWGrad,self.TotalBGrad)
        self.TotalError = self.err0  # + self.err1

    def MasterLearn(self):
        for itn in range(self.NumIters):
            self.MasterSim()

FILE 2: runtestgrad.py

import testgrad
from testgrad import SimBEN

for i in range(100):
	simBEN = SimBEN()
	simBEN.MasterLearn()

FILE 3: testgradproc.py

import torch
from torch.autograd import Variable
import numpy as np
import math

def runtest():
    NumSimIters = [10,30,30]
    NumInputStates = 1
    num_ions = 3
    num_cells = 3
    NumTargetCells = 2
    NumEdges = 3
    Learn = True
    NumLearningIters = 1
    if Learn:
        NumIters = NumLearningIters
        Weights = Variable(torch.ones(NumEdges)*0.5,requires_grad=True)  # for the more nonlinear version of simulate
        Bias = Variable(torch.ones(num_cells)*0.5,requires_grad=True)  # for the more nonlinear version of simulate
    EdgeList = np.array([[0,2],[0,1],[1,2]])
    W = torch.zeros(num_cells*num_cells).view(num_cells,num_cells)
    W[EdgeList[:,1],EdgeList[:,0]] = Weights.clone()
    B = Bias.clone()
    z_array = torch.FloatTensor([1,1,-1]).view(-1,1)
    vm = torch.zeros(NumInputStates,num_cells)
    vm[:,0:NumTargetCells] = torch.FloatTensor([-0.08,0.08])
    cc_sig = torch.tensor([0.5]).repeat(NumInputStates,num_cells)
    cc_sig[:,0:NumTargetCells] = torch.FloatTensor([0.0,1.0])
    Dm_array = torch.zeros(NumInputStates,num_ions,num_cells)
    Dm_array[:,0,:] = 1.0e-18
    Dm_array[:,1,:] = 1.0e-18
    Dm_array[:,2,:] = 0.0e-18
    cc_cells = torch.zeros(NumInputStates,num_ions,num_cells)
    cc_cells[:,0,:] = 10
    cc_cells[:,1,:] = 125
    cc_cells[:,2,:] = 135
    vm_scale = 1.0
    F = 96485 # Faraday constant [C/mol]
    cm = 0.05  # patch capacitance of membrane [F/m2]
    cell_r = 5.0e-6
    cell_sa = (4*math.pi*cell_r**2) # cell surface area
    cell_vol = ((4/3)*math.pi*cell_r**3) # cell volume
    OutputVolts = torch.FloatTensor([-0.08,0.08])

    def Simulate_nonlinear(NumSimIters,W,B,vm,z_array,cc_cells,cc_sig,Dm_array):    # NumSimIters is a local variable
        vm_time = torch.FloatTensor([])
        z_batch = z_array.clone().repeat(NumInputStates,1,1)  # new shape = (p,m,1)
        for t in np.arange(0,NumSimIters):
            vm_time = torch.cat((vm_time, vm))
            cc_sig = cc_sig + B*0.002 + torch.sum(W)*0.001
            Dc_prop = cc_sig.clone()
            Dm_array = Dm_array * Dc_prop
            cc_cells = cc_cells + Dm_array*0.001 + torch.sum(W)*0.001
            rho = torch.sum(cc_cells*z_batch,dim=1,keepdim=False)*F 
            vm = (1/cm)*rho*(cell_vol/cell_sa)
        vm_time = vm_time.view(NumSimIters,NumInputStates,num_cells) 
        return(vm_time)

    def MSELoss(vm_list,Outputs):
        target_vm = Outputs
        target_vm = target_vm * vm_scale
        err = (vm_list - target_vm).pow(2).mean() 
        return(err)

    for itn in range(NumIters):
        TotalWGrad = torch.zeros(1).resize_as_(Weights)
        TotalBGrad = torch.zeros(1).resize_as_(Bias)
        vm_time = Simulate_nonlinear(NumSimIters[0],W,B,vm,z_array,cc_cells,cc_sig,Dm_array)
        err = MSELoss(vm_time[-10:,:,0:NumTargetCells],OutputVolts[0])
        if Learn:
            err.backward(retain_graph=True)
            TotalWGrad = TotalWGrad + Weights.grad
            TotalBGrad = TotalBGrad + Bias.grad
            print(TotalWGrad,TotalBGrad)
        TotalError = err

FILE 4: runtestgradproc.py

from testgradproc import runtest

for i in range(100):
	runtest()

Thank you again for your time!!

UPDATE: @ptrblck, I’ve simplified the code much further. Intriguingly, now both versions show erratic gradients!

FILE 1: testgrad.py

import torch
from torch.autograd import Variable
import numpy as np

class SimBEN():

    def __init__(self):  # these parameters depend on the experiment
        self.num_cells = 3
        self.EdgeList = np.array([[0,2],[0,1],[1,2]])
        self.NumEdges = len(self.EdgeList)
        self.Weights = torch.ones(self.NumEdges,requires_grad=True)  # for the more nonlinear version of simulate
        self.W = torch.zeros(self.num_cells*self.num_cells).view(self.num_cells,self.num_cells)
        self.W[self.EdgeList[:,1],self.EdgeList[:,0]] = self.Weights
        self.OutputVolts = torch.FloatTensor([0.08])

    def Simulate(self):    
        self.cc_cells = torch.ones(self.num_cells)*10
        self.cc_cells = self.cc_cells + torch.sum(self.W)*0.001 # + self.Dm_array*0.001 
        self.vm = torch.sum(self.cc_cells).view(1)

    def MSELoss(self,vm,target_vm):
        self.err = (vm - target_vm).pow(2).mean() 

    def MasterSim(self):
        self.TotalWGrad = torch.zeros(1).resize_as_(self.Weights)
        self.Simulate()
        self.MSELoss(self.vm,self.OutputVolts)
        self.err.backward()
        self.TotalWGrad = self.TotalWGrad + self.Weights.grad
        print(self.TotalWGrad)

    def MasterLearn(self):
        self.MasterSim()

FILE 2: runtestgrad.py

import testgrad
from testgrad import SimBEN

for i in range(100):
	simBEN = SimBEN()
	simBEN.MasterLearn()

FILE 3: testgradproc.py

import torch
from torch.autograd import Variable
import numpy as np

def runtest():
    num_cells = 3
    EdgeList = np.array([[0,2],[0,1],[1,2]])
    NumEdges = len(EdgeList)
    Weights = torch.ones(NumEdges,requires_grad=True)  # for the more nonlinear version of simulate
    W = torch.zeros(num_cells*num_cells).view(num_cells,num_cells)
    W[EdgeList[:,1],EdgeList[:,0]] = Weights
    OutputVolts = torch.FloatTensor([30.08])

    def Simulate(W):    
        cc_cells = torch.ones(num_cells)*10
        cc_cells = cc_cells + torch.sum(W)*0.001
        vm = torch.sum(cc_cells).view(1)
        return(vm)

    def MSELoss(vm,target_vm):
        err = (vm - target_vm).pow(2).mean() 
        return(err)

    TotalWGrad = torch.zeros(1).resize_as_(Weights)
    vm = Simulate(W)
    err = MSELoss(vm,OutputVolts)
    err.backward()
    TotalWGrad = TotalWGrad + Weights.grad
    print(TotalWGrad)

FILE 4: runtestgradproc.py

from testgradproc import runtest

for i in range(100):
	runtest()

FINAL UPDATE: I think I found the source of the problem - it’s the .resize_as_(). So, if instead of

self.TotalWGrad = torch.zeros(1).resize_as_(self.Weights)

I have the following:

self.TotalWGrad = torch.zeros(self.NumEdges)

then the gradients are consistent.

1 Like

Awesome work!
Sorry, was too slow to find the issue.
Generally I wouldn’t recommend to use resize_ or resize_as_ to initialize tensors, since new memory is uninitialized and might yield strange results. In my opinion if you just want to initialize some tensors you should use the factory methods (e.g. torch.zeros etc.).

From the docs:

If the number of elements is larger than the current storage size, then the underlying storage is resized to fit the new number of elements. If the number of elements is smaller, the underlying storage is not changed. Existing elements are preserved but any new memory is uninitialized.

1 Like

It seems that gradient computation in case of AdaptiveAvgPool2d is also broken
My gradients were different even after disabling all flags for randomization:

np.random.seed(123)
torch.manual_seed(123)
torch.cuda.manual_seed_all(123)
torch.cuda.manual_seed(123)
torch.backends.cudnn.deterministic=True
torch.backends.cudnn.benchmark=False

Hope this helps anyone struggling with the same bug :slight_smile: .