I was going through the optimizers available in Pytorch and I found that Sparse Adam optimizer doesn’t include weight decay. Is there a vital piece of theory I’m missing here or is it possible to implement this?
I was going through the optimizers available in Pytorch and I found that Sparse Adam optimizer doesn’t include weight decay. Is there a vital piece of theory I’m missing here or is it possible to implement this?