Given that you don’t send any tensor to the cpu explicitly in these modules, they will run asyncrhonously.
It then depends how much work has to be done in each module but the forward pass should return very fast after queuing the job on the GPU. Everything should then run as fast as possible on the GPU.
Do you see very small gpu usage? Are big are the ops your doing? Are you actually spending more time on the cpu side than the gpu side?