Hello everyone, I hope you are having a great time.
I recently wanted to create a simple autoencoder and for that used this thread where @smth provided an example on how to create an autograd Function for the aformentioned autoencoder.
and the code he wrote is this :
However, this code fails completely on newer versions of Pytorch (e.g 1.1.0) with the error indicating the backward method needs to return as many values as the forward method received.
I asked but got no answer and I myself also couldnt specify I dont need a gradient for the second argument. I tried to set ctx.needs_input_grad but thats read-only .
What should I do here? should I simply return None, or 0 for the arguments that I’m not interested in?
Based on the discussions here I found out that I should be using None for any inputs that I dont want the gradients for. so this is done.
I however would appreciate if anyone could tell me what that last snippet do though.
its greatly appreciated.
May I ask another question if you dont mind?
I noticed we have different ways for imposing/enforcing the sparsity constraint. one other way we can achieve sparsity is likely to get the average of the layers output and treat that as the loss for sparsity and add it e.g. to the reconstruction loss and do the backward pass.
That is simply do in the forward pass :
for e in range(epochs):
for imgs,_ in dataloader_train:
imgs = imgs.to(device)
output, sparsity_loss = sae_model(imgs)
loss = criterion(output, imgs)
loss_f = loss + (sparsity_ratio * sparsity_loss)
optimizer.zero_grad()
loss_f.backward()
optimizer.step()
print(f'epoch: {e}/{epochs} loss_f: {loss_f.item():.6f} loss: { loss.item():.6f}'\
f' sparsity loss: {sparsity_loss.item():.6f} lr = {scheduler.get_lr()}')
scheduler.step()
However I noticed the outcome of these two methods are not the same exactly . why is that?
This is the weights outputs using the first method(L1Penalty) :
They use the same exact hyperparameters for optimization which is SGD with lr=0.98 for 20 epochs.
shouldn’t they have the same effect on weights? or is it because of the mean() of the activations, that the gradients will be spread more homogeneously and thus their effect is much milder than when we directly add a term to the gradients?