Optimizing CUDA memory pipeline for RNN

Note that you only need to make an input or hidden volatile (it will propagate through the graph with a very high precedence).