Autoencoder implementation with different activation function

Josiane_Rodrigues · May 6, 2019, 10:01pm

Hello community,

I need to implement an autoencoder network described in this article https://arxiv.org/pdf/1803.09065. The model is very simple. Contains only a hidden layer, which corresponds to a binarization layer of the input code. My whole difficulty is the activation function used in the hidden layer is non-differentiable and therefore the same weight matrix of the output layer is used to update the input layer. I would like to know how I can do an activation function in pytorch that does not need to be derived and that uses the output layer weight matrix to update the weights of the input layer. I will be very grateful if anyone can help me.

The network architecture follows below.

Captura%20de%20tela%20de%202019-05-06%2017-57-54

ptrblck · May 6, 2019, 10:55pm

Could you explain it a bit?
Would you like to use the same gradients from the output layer on your input layer or copy the weights into the input layer after each update?

Josiane_Rodrigues · May 6, 2019, 11:14pm

Hi @ptrblck,

I would like to copy the weights into the input layer after each update of the output layer.

ptrblck · May 6, 2019, 11:36pm

Thanks for the information!
In that case this code example might work.
I tried to come up with a similar model architecture, so I just used three linear layers and had to transpose the weight matrix since I used a “bottleneck” layer.
Also, I call x.detach() after the first linear layer to fake your behavior of the gradient loss in the input layer:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(10, 5)
        self.fc2 = nn.Linear(5, 5)
        self.fc3 = nn.Linear(5, 10)
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = x.detach()  # Fake your use case so that fc1 cannot be updated
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

data = torch.randn(10, 10)
target = torch.randn(10, 10)
model = MyModel()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3)

for epoch in range(10):
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output, target)
    loss.backward()
    # print(model.fc1.weight.grad) # Will be None
    optimizer.step()
    print('Epoch {}, loss {}'.format(epoch, loss.item()))
    
    # Copy weights from output layer to input layer
    with torch.no_grad():
        model.fc1.weight.copy_(model.fc3.weight.t())

Let me know, if this helps or if I misunderstood your use case.

Josiane_Rodrigues · May 7, 2019, 12:15am

Hi @ptrblck,

Thank you. I’ll test and say if it worked

Josiane_Rodrigues · May 7, 2019, 2:12pm

Hi @ptrblck,

It worked! Thank you! I have another question =]. I need to use a heaviside (step) function from the input to the hidden layer, instead of the relu function applied here (x = F.relu (self.fc1 (x))), so that the values of this layer are binary [0,1]. I’ve seen that it has the heaviside function in numpy, but it’s conflicting with the pytorch because of the type. Any idea?

ptrblck · May 7, 2019, 2:23pm

What kind of type issue do you see?

Josiane_Rodrigues · May 7, 2019, 2:30pm

I would have to do something like this:

def forward(self, x):
x = np.heaviside(self.fc1(x),1)
x = x.detach()
x = F.tanh(self.fc2(x))
x = self.fc3(x)
return x

This error appears:

RuntimeError: Can’t call numpy() on Variable that requires grad. Use var.detach().numpy() instead.

ptrblck · May 7, 2019, 2:43pm

You could probably indeed detach the output of fc1, since you won’t calculate gradients for it.

x = np.heaviside(self.fc1(x).detach().numpy(), 0)
x = torch.from_numpy(x)

Alternatively, you could also implement np.heaviside in PyTorch.

Josiane_Rodrigues · May 7, 2019, 2:50pm

This worked \o/. Thank you

Josiane_Rodrigues · May 8, 2019, 1:33am

Hi @ptrblck,

I would like to take one last doubt if possible.
This model uses a loss function that is the sum of two terms that follow.

Captura%20de%20tela%20de%202019-05-07%2021-24-48

In the second term, W is the weight matrix and I is an identity matrix.
For both functions I have used MSELoss to implement. I tried the following.

loss = criterion_1(output, target) + criterion_2(torch.mm(model.fc3.weight,model.fc3.weight.t()), Variable((torch.eye(10))))

It did not make any error, but I was in doubt if the code is actually doing what is described in the two functions of the model.

ptrblck · May 8, 2019, 9:34am

I’m not sure how W and the matrix norm are defined in the paper.
If they are using the Frobenius norm, you cannot use nn.MSELoss:

w = torch.matmul(model.fc3.weight.t(),model.fc3.weight)
i = torch.eye(w.size(0))

# Frobenius norm
0.5 * (w - i).norm(p=2)
((w - i)**2).sum()**0.5 * 0.5

# MSE
criterion = nn.MSELoss()
criterion(w, i)
((w - i)**2).mean()

But as I said, I’m not familiar with the paper so I might misunderstand the notation.

The first loss looks alright.

Josiane_Rodrigues · May 8, 2019, 2:31pm

Thanks for the explanation.
In the papers this notation is usually a norm L2 squared. In that case, I cannot use MSELoss, right?
Are there any other pytorch functions that I could use in this case?

ptrblck · May 8, 2019, 2:34pm

Would my first approach work? (.norm(p=2))

Josiane_Rodrigues · May 8, 2019, 2:53pm

Yes! In the paper, the authors did not specify the norm, but the L2 norm is usually used.

Josiane_Rodrigues · May 8, 2019, 3:35pm

Sorry, I feel a little confused now. How would the norm be applied to the activation function?

ptrblck · May 8, 2019, 3:47pm

I’m a bit confused, too.
Which activation function do you mean and what would you like to do with this function?

Based on the formula and code you’ve posted, it seems that l_reg is calculated based on the weight matrix of the last linear layer.

Josiane_Rodrigues · May 8, 2019, 3:52pm

Sorry. I wrote wrong, I meant loss function and no activation function. The loss function I want to implement is made up of the two functions mentioned above. Would it be this?

loss = criterion(output, target) + 0.5*(w - i).norm(p=2)

ptrblck · May 10, 2019, 12:26pm

I think this might work, but you should really check it with someone having more insight into the paper (or the general approach).

Josiane_Rodrigues · May 13, 2019, 7:50pm

Really thank you for help!!!