Sorry for the late reply! I used bottleneck module to profile it, and I still see that most of running time is spent on loading data and time on backward is quite small:
------------------ --------------- --------------- --------------- --------------- ---------------
Name CPU time CUDA time Calls CPU total CUDA total
------------------ --------------- --------------- --------------- --------------- ---------------
stack 878842.491us 1013487.305us 1 878842.491us 1013487.305us
stack 868481.575us 998066.406us 1 868481.575us 998066.406us
stack 861072.974us 1006662.109us 1 861072.974us 1006662.109us
stack 860799.906us 995949.219us 1 860799.906us 995949.219us
stack 216507.028us 249775.391us 1 216507.028us 249775.391us
stack 213380.171us 247549.805us 1 213380.171us 247549.805us
ExpandBackward 20528.386us 114.746us 1 20528.386us 114.746us
sum 20522.516us 110.840us 1 20522.516us 110.840us
_sum 20509.686us 102.051us 1 20509.686us 102.051us
ExpandBackward 20494.596us 50.781us 1 20494.596us 50.781us
sum 20489.766us 46.875us 1 20489.766us 46.875us
_sum 20479.066us 41.016us 1 20479.066us 41.016us
mean 11524.652us 51.270us 1 11524.652us 51.270us
mean 11162.832us 66.406us 1 11162.832us 66.406us
mean 9420.275us 62.500us 1 9420.275us 62.500us