Hello, all.
I bulit a seqtoseq network model with attention mechanism, and put it on CUDA like this model.cuda()
, and all input tensors are CUDA tensor. When I run loss.backward()
, I found it still needs cpu to compute like below status:
Any one could tell me why the model on CUDA still needs cpu to compute gradients? Thanks.