I bulit a seqtoseq network model with attention mechanism, and put it on CUDA like this
model.cuda(), and all input tensors are CUDA tensor. When I run
loss.backward(), I found it still needs cpu to compute like below status:
Any one could tell me why the model on CUDA still needs cpu to compute gradients? Thanks.