Greedy optimisation with random noise in gradients

Prasad_Raghavendra · March 28, 2021, 6:49pm

I am trying to write code for simple objective: I have usual PyTorch gradients, I make a copy of these gradients and add some noise to it. For each batch, I check the loss for the original gradients and I check the loss for the new gradients. I pick the gradients that gives me lower loss values. While I alter gradients, I do not wish to alter optimiser momentum parameters learnt via optimiser.step(). Can you let me know answers to the following two questions?

How do I reference all gradients at once (instead of by layer name like model.conv1.grad) may be like a list comprehension and assign values?
What mistake am I doing in my code? Do I need to set gradients to zero somehow like grad.zero_()? If so, I think I need requires_grad = True. So, is deep copy of gradients a solution? If I do a deep copy of the model, how do I assign the original optimiser parameters to the new model?

P.S.: For reference, I have also included important function:

def train_epoch(eta, model, train_loader, criterion):
    model.train()

    running_loss = 0.0
    predictions = []
    ground_truth = []
    loss_den = 1
    
    start_time = time.time()
    optimiser = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    
    for batch_idx, (data, target) in enumerate(train_loader):
        
        data = data.to(device)
        target = target.to(device)
    
        #previous model
        outputs = model(data.float())
        _, predicted = torch.max(outputs.data, 1)
        total_predictions = target.size(0)
        correct_predictions = (predicted == target).sum().item()
        acc = (correct_predictions/total_predictions)*100.0
        
        loss = criterion(outputs, target)
        loss.backward()
        optimiser.step()
        
        #convGrad is the set of old gradients
        conv1grad = model.conv1.weight.grad
        conv2grad = model.conv2.weight.grad
        conv3grad = model.conv3.weight.grad
                
        noisyGrad1 = eta * np.abs(conv1grad.detach().cpu().numpy())
        noisyGrad2 = eta * np.abs(conv2grad.detach().cpu().numpy())
        noisyGrad3 = eta * np.abs(conv3grad.detach().cpu().numpy())
        
        newGrad1 = conv1grad + torch.from_numpy(np.random.uniform(-noisyGrad1, noisyGrad1))
        newGrad2 = conv2grad + torch.from_numpy(np.random.uniform(-noisyGrad2, noisyGrad2))
        newGrad3 = conv3grad + torch.from_numpy(np.random.uniform(-noisyGrad3, noisyGrad3))
        
        model.conv1.weight.grad = nn.Parameter(torch.from_numpy(newGrad1.detach().numpy()).float())
        model.conv2.weight.grad = nn.Parameter(torch.from_numpy(newGrad2.detach().numpy()).float())
        model.conv3.weight.grad = nn.Parameter(torch.from_numpy(newGrad3.detach().numpy()).float())
        
        #The new loss value for the new gradients is computed
        outputs = model(data.float())
        _, predicted = torch.max(outputs.data, 1)
        total_predictions = target.size(0)
        correct_predictions = (predicted == target).sum().item()
        acc_new = (correct_predictions/total_predictions)*100.0
        
        loss_new = criterion(outputs, target)
        loss_den += 1

        #calculuating confusion matrix
        predictions += list(predicted.detach().cpu().numpy())
        ground_truth += list(target.detach().cpu().numpy())

        if loss_new.item() > loss.item():
            model.conv1.weight.grad = conv1grad
            model.conv2.weight.grad = conv2grad
            model.conv3.weight.grad = conv3grad

            running_loss += loss.item()
        else:
            running_loss += loss_new.item()
        
    end_time = time.time()

    running_loss /= loss_den
    
    print('Training Loss: ', running_loss, 'Time: ',end_time - start_time, 's')
    
    return running_loss, model

Prasad_Raghavendra · March 29, 2021, 12:33pm

I read my own question again (typed while I was having some other work; Sorry for that). My question is somewhat difficult to comprehend when seeing for the first time. So, let me ask questions one by one. First question:

How do I access all weights of a model? Currently, I am accessing with layer names like model.conv1.weight.grad. But, is there anything like model.weight.grad to access all gradients (setter and getter)?

To all PyTorch developers, thanks for the beautiful library and to those helping, thanks for your time.

albanD · March 29, 2021, 1:16pm

Hi,

You will still have to access the gradients for each weight one by one but you can do that automatically with:

for param in model.parameters():
    param.grad # Some something with it here

Prasad_Raghavendra · March 30, 2021, 6:50pm

Thank you so much for this. It answered my first question so well. Let me consolidate my code and ask the same question better.

My next and last question is how I can undo loss.backward() for a given batch. I wish to compute the autograd backprop gradients. But, I do not wish to apply those gradients to model computation graph. How could I do that?

The final objective is to get two sets of gradients for each batch (regular and noisy one) and apply the gradient set that produces lower loss values. How can I achieve that?

The cleaner code (thanks to @albanD ) is as follows:

def train_epoch(eta, model, train_loader, criterion):
    model.train()
    running_loss = 0.0; predictions = []; ground_truth = []; loss_den = 1
    
    start_time = time.time()
    optimiser = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    
    for batch_idx, (data, target) in enumerate(train_loader):
        
        data = data.to(device); target = target.to(device)
    
        #previous model
        outputs = model(data.float())        
        loss = criterion(outputs, target)
        loss.backward()
        optimiser.step()
        
        #convGrad is the set of old gradients
        oldGrad = list()
        for param in model.parameters():
            convGrad = param.grad
            oldGrad.append(convGrad)
            noisyGrad = eta * np.abs(convGrad.detach().cpu().numpy())
            newGrad = convGrad + torch.from_numpy(np.random.uniform(-noisyGrad, noisyGrad))
            param.grad = nn.Parameter(torch.from_numpy(newGrad.detach().numpy()).float())
        
        #The new loss value for the new gradients is computed
        outputs = model(data.float())
        loss_new = criterion(outputs, target)
        loss_den += 1

        if loss_new.item() > loss.item():
            for paramIdx, param in enumerate(model.parameters()):
                param.grad = oldGrad[paramIdx]

            running_loss += loss.item()
        else:
            running_loss += loss_new.item()
        
    end_time = time.time()
    running_loss /= loss_den
    print('Training Loss: ', running_loss, 'Time: ',end_time - start_time, 's')
    
    return running_loss, model

Prasad_Raghavendra · April 3, 2021, 1:43pm

To summarise my previous question:

How do I compute backprop gradients for a given batch but, not apply those gradients to the model automatically?
How can I do optimiser.step() for the gradients I manually apply to the model?

albanD · April 5, 2021, 1:12pm

Hi,

You can use autograd.grad() to get the value of the gradient as a list and not modify the .grad fields of the parameters.
You will need to update the .grad fields before calling the optimizer.step() function I’m afraid.

Also I am fairly confused by this line: param.grad = nn.Parameter(torch.from_numpy(newGrad.detach().numpy()).float()) why do you wrap the grad in a Parameter? And why do you send it to numpy?

Prasad_Raghavendra · April 13, 2021, 1:05am

Thank you so much for your time and helpful reply.

Also I am fairly confused by this line: param.grad = nn.Parameter(torch.from_numpy(newGrad.detach().numpy()).float()) why do you wrap the grad in a Parameter? And why do you send it to numpy?

I tried reading it up before replying. I was going through 11-785 Deep Learning and I think PyTorch is similar. I am still confused by autograd (even after a week) if I have to manually apply gradients myself.

From what I know, Parameter wraps a numpy array as a tensor. Because, I want to select the gradients manually and apply at each iteration, my gradients are in numpy array (because I am more familiar with numpy as compared to PyTorch).

You will need to update the .grad fields before calling the optimizer.step() function I’m afraid.

Ok. But, how do I invoke autograd.grad() for a model (without applying gradients to the model)? I was seeing Automatic differentiation package - torch.autograd — PyTorch 2.1 documentation . So, do I do model.grad()?

albanD · April 13, 2021, 2:07pm

That is not true.
To get a Tensor from a numpy array, you can do t = torch.from_numpy(your_numpy_array).
A nn.Parameter is a thin wrapper around Tensor that has special meaning in the torch.nn library. Namely it is a parameter of a Module and so will be returned by mod.parameters() call.

But, how do I invoke autograd.grad()

grads = autograd.grad(loss, model.parameters())

Prasad_Raghavendra · April 13, 2021, 2:56pm

Namely it is a parameter of a Module and so will be returned by mod.parameters() call.

Thank you! Let me format the code and ask.

This is so helpful! Thank you @albanD .