How to load all data into GPU for training

Sorry for the late reply! I used bottleneck module to profile it, and I still see that most of running time is spent on loading data and time on backward is quite small:

------------------  ---------------  ---------------  ---------------  ---------------  ---------------
Name                       CPU time        CUDA time            Calls        CPU total       CUDA total
------------------  ---------------  ---------------  ---------------  ---------------  ---------------
stack                  878842.491us    1013487.305us                1     878842.491us    1013487.305us
stack                  868481.575us     998066.406us                1     868481.575us     998066.406us
stack                  861072.974us    1006662.109us                1     861072.974us    1006662.109us
stack                  860799.906us     995949.219us                1     860799.906us     995949.219us
stack                  216507.028us     249775.391us                1     216507.028us     249775.391us
stack                  213380.171us     247549.805us                1     213380.171us     247549.805us
ExpandBackward          20528.386us        114.746us                1      20528.386us        114.746us
sum                     20522.516us        110.840us                1      20522.516us        110.840us
_sum                    20509.686us        102.051us                1      20509.686us        102.051us
ExpandBackward          20494.596us         50.781us                1      20494.596us         50.781us
sum                     20489.766us         46.875us                1      20489.766us         46.875us
_sum                    20479.066us         41.016us                1      20479.066us         41.016us
mean                    11524.652us         51.270us                1      11524.652us         51.270us
mean                    11162.832us         66.406us                1      11162.832us         66.406us
mean                     9420.275us         62.500us                1       9420.275us         62.500us