Confused about GPU memory usage tricks

I’m playing with rnn-char example from here:

Here’s the simplified code I’m running:

When I run it on GPU, memory usage is almost 12GB (Pascal Titan X). GPU utilization is ~20%.
I added ‘requires_grad=False’ to the input Variable, but that didn’t help.

Any ideas why so much memory is in use?

Is this normal for a RNN network of this size (3 layer GRU with hidden_size=500, and input_size=100?