How to optimize code when switching from CPU to GPU?


I am trying to improve training performance of a Transformers type, not very large, neural network, by using a single T4 gpu (AWS g4dn.xlarge instance).

I’ve changed my code, replacing all the numpy functions with torch functions, following advice from 7 Tips To Maximize PyTorch Performance

but I see only about 17% improvement in performance (on a 1000 training epochs)

Running profiler
36% of time is spent on method 'run_backward' of 'torch._C._EngineBase
17% on built-in method tensor

Any suggestions on code optimization for GPU ?
is there even value in using GPUs if the NN is not huge ?

If there is no way to make the GPU significantly faster, it might be cheaper to get more CPU and parallel training on CPUs, wouldn’t it ?