I was trying to implement the model in this paper “Dynamic Coattention Networks for QA” in PyTorch, and noticed that many of my parameters were not getting trained at all. After some debugging, the problem seems to occur because of an argmax operation in the decoder (on page 4 of the paper). The output (i.e. the second return value of torch.max) has require_grad as false, which makes sense since argmax is not differentiable. However, the author of the paper trains his model using a basic Adam optimizer – how is this possible? What work-around would allow me to do the same?
The paper’s author uses Chainer, which shouldn’t be that different from PyTorch, right?
Also, I tried implementing this model in TensorFlow, and it did work as written – why was that the case? Does tensorflow implement argmax differently, in some ‘soft’ way?
Thanks!