Updating only some values in backward pass

IsCoelacanth · February 25, 2019, 1:13pm

I’m trying to implement a masked loss function which only works on a certain region of the output. Over the training phase, the regions not considered in the loss are getting blown/vanish.

    def masked_l1_loss(self, x, trgt):
        mask = (trgt > -1).detach()
        diff = trgt - x
        diff = diff[mask]
        return diff.abs().mean()

Kushaj · February 25, 2019, 5:59pm

Can you elaborate a bit more? Are you referring to the regions of output as ‘regions’ in your questions .

IsCoelacanth · February 25, 2019, 6:16pm

yes, the regions/section of image in the models output. The loss is calculated only for a region of valid pixels in the target output.
Something like in this, they define the supervised loss for only valid depth values in the target. rest are masked out.

Kushaj · February 25, 2019, 7:52pm

If I understand your problem. You only want to backpropagate through only those pixels that are True in the mask. In that case, you cannot use the approach, you defined in the question as that would backpropagate through all the pixels.
Use this code instead

class myModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 10)
    def forward(self, x, trgt):
        x = self.linear(x)
        
        # Mask code
        mask = trgt > -1
        x[mask==0] = 0
        return x
model = myModel()

# Dummy inputs:- Assume mask is 10 units
x = torch.rand(1, 1)
trgt = torch.arange(-4, 6).view(1, 10).to(dtype=torch.float)

# Initial param valuies
print('Before update')
print(model.state_dict())

# Run a epoch step
# Large lr to visualize easily
optimizer = optim.SGD(model.parameters(), lr=100)
optimizer.zero_grad()

output = model(x, trgt)
# This step is copied from your question as you take diff[mask]
trgt[trgt<=-1]=0

loss = torch.mean(torch.abs(output - trgt))
loss.backward()
optimizer.step()

print('\nAfter update')
print(model.state_dict())

Before update
OrderedDict([('linear.weight', tensor([[ 0.2001],
        [ 0.8265],
        [ 0.4066],
        [ 0.8853],
        [ 0.3886],
        [-0.1819],
        [ 0.4687],
        [-0.8151],
        [ 0.6224],
        [-0.1107]])), ('linear.bias', tensor([ 0.1837,  0.0316, -0.2824,  0.9075,  0.6819,  0.6282, -0.9894, -0.2376,
         0.2664, -0.9535]))])

After update
OrderedDict([('linear.weight', tensor([[ 0.2001],
        [ 0.8265],
        [ 0.4066],
        [ 0.8853],
        [-7.4897],
        [ 7.6964],
        [ 8.3469],
        [ 7.0632],
        [ 8.5007],
        [ 7.7676]])), ('linear.bias', tensor([ 0.1837,  0.0316, -0.2824,  0.9075, -9.3181, 10.6282,  9.0106,  9.7624,
        10.2664,  9.0465]))])

As you see only those parameters gor upgraded that were part of the mask.

IsCoelacanth · February 25, 2019, 8:28pm

how do I scale this up to a larger model? (say resnet) Hardcoding in the mask for each step isn’t an optimal approach.

Kushaj · February 25, 2019, 8:38pm

By scale are you referring to the cost of computing the mask=trgt>-1. You can precompute them and give them as output of the dataloader.

IsCoelacanth · February 25, 2019, 8:50pm

Nope, by scale I meant a network with more than one layer. say in an auto-encoder.

Kushaj · February 25, 2019, 8:53pm

You don’t have to worry about that. The concept is similar to Dropout, you zero the activations in the top layers and as a result the activations are not carries in the previous layers.

Say you have 2 layers. And you apply the mask in the last layer removing 4 neurons from the last layer. During the back prop their activations would be zero and as a result the neurons in layers 1 would be updated by the other (n-4) neurons.

Hope it helps.

TigerYan86 · September 24, 2023, 12:09am

you cannot use the approach, you defined in the question as that would backpropagate through all the pixels

Hmm I don’t think I agree with your statement here. I modified your example code using OP’s approach (i.e. taking only the values from tensor that are not masked out during mean()) and I got results similar to yours without issue.

class myModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1, 10)

    def forward(self, x):
        x = self.linear(x)
        
        # you don't need to do this as long as you use the mask when losses are averaged
        # Mask code
        # mask = trgt > -1
        # x[mask==0] = 0

        return x


model = myModel()

# Dummy inputs:- Assume mask is 10 units
x = torch.rand(1, 1)
trgt = torch.arange(-4, 6).view(1, 10).to(dtype=torch.float)
mask = trgt > -1

# Initial param valuies
print("Before update")
print(model.state_dict())

# Run a epoch step
# Large lr to visualize easily
optimizer = optim.SGD(model.parameters(), lr=100)
optimizer.zero_grad()

output = model(x)
# trgt[trgt <= -1] = 0

# here loss is properly masked
loss = torch.mean(torch.abs(output - trgt)[mask])
loss.backward()
optimizer.step()

print("\nAfter update")
print(model.state_dict())

And you will get similar results where the first 4 bias params are not updated

Before update
OrderedDict([('linear.weight', tensor([[ 0.4616],
        [ 0.6589],
        [ 0.9513],
        [ 0.1962],
        [-0.5113],
        [-0.9663],
        [-0.0329],
        [-0.4200],
        [ 0.9178],
        [-0.9673]])), ('linear.bias', tensor([ 0.6523,  0.6664,  0.0597, -0.8981, -0.9052, -0.9825,  0.9581, -0.8786,
        -0.9900,  0.0388]))])

After update
OrderedDict([('linear.weight', tensor([[ 0.4616],
        [ 0.6589],
        [ 0.9513],
        [ 0.1962],
        [15.6310],
        [15.1760],
        [16.1094],
        [15.7222],
        [17.0600],
        [15.1749]])), ('linear.bias', tensor([ 0.6523,  0.6664,  0.0597, -0.8981, 15.7615, 15.6842, 17.6248, 15.7881,
        15.6767, 16.7055]))])