Model on CUDA run .backward() still needs using cpu to compute?

Hello, all.
I bulit a seqtoseq network model with attention mechanism, and put it on CUDA like this model.cuda(), and all input tensors are CUDA tensor. When I run loss.backward(), I found it still needs cpu to compute like below status:

Any one could tell me why the model on CUDA still needs cpu to compute gradients? Thanks.

1 Like