Sudden OOM with lstm

After running many batches, I get an oom with lstm:

    t....................THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1502009910772/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 353, in <module>
    run(**args.__dict__)
  File "train.py", line 271, in run
    loss.backward(rationale_selected_node)
  File "/mldata/conda/envs/pytorch/lib/python3.6/site-packages/torch/autograd/variable.py", line 156, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/mldata/conda/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 98, in backward
    variables, grad_variables, retain_graph)
  File "/mldata/conda/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 291, in _do_backward
    result = super(NestedIOFunction, self)._do_backward(gradients, retain_variables)
  File "/mldata/conda/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 299, in backward
    result = self.backward_extended(*nested_gradients)
  File "/mldata/conda/envs/pytorch/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 313, in backward_extended
    self._reserve_clone = self.reserve.clone()
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1502009910772/work/torch/lib/THC/generic/THCStorage.cu:66

The weird thing is, I printed the gpu memory every 3 seconds whilst it was running, and nothing odd at the moment this occurs:

| N/A   68C    P0   151W / 150W |   6796MiB /  7618MiB |     89%      Default |
| N/A   67C    P0   137W / 150W |   6796MiB /  7618MiB |     76%      Default |
| N/A   68C    P0   102W / 150W |   6796MiB /  7618MiB |     74%      Default |
| N/A   68C    P0   158W / 150W |   6796MiB /  7618MiB |     98%      Default |
| N/A   67C    P0   118W / 150W |   6796MiB /  7618MiB |     97%      Default |
| N/A   67C    P0   107W / 150W |   6796MiB /  7618MiB |     96%      Default |
| N/A   67C    P0   122W / 150W |   6796MiB /  7618MiB |     96%      Default |
| N/A   68C    P0   115W / 150W |   6796MiB /  7618MiB |     97%      Default |
| N/A   68C    P0   135W / 150W |   6796MiB /  7618MiB |     96%      Default |
| N/A   67C    P0   139W / 150W |   6796MiB /  7618MiB |     76%      Default |
| N/A   67C    P0   121W / 150W |   6796MiB /  7618MiB |     98%      Default |
| N/A   67C    P0   141W / 150W |   6796MiB /  7618MiB |     74%      Default |
| N/A   68C    P0   160W / 150W |   6796MiB /  7618MiB |     96%      Default |
| N/A   68C    P0   101W / 150W |   6796MiB /  7618MiB |     97%      Default |
| N/A   68C    P0   159W / 150W |   6796MiB /  7618MiB |     81%      Default |
| N/A   68C    P0   140W / 150W |   6796MiB /  7618MiB |     75%      Default |
| N/A   68C    P0   144W / 150W |   6796MiB /  7618MiB |     75%      Default |
| N/A   66C    P0    80W / 150W |   6796MiB /  7618MiB |     95%      Default |
| N/A   67C    P0   108W / 150W |   6796MiB /  7618MiB |     75%      Default |
| N/A   68C    P0   131W / 150W |   6796MiB /  7618MiB |     75%      Default |
| N/A   68C    P0   135W / 150W |   6796MiB /  7618MiB |     76%      Default |
| N/A   67C    P0   102W / 150W |   6796MiB /  7618MiB |     95%      Default |
| N/A   67C    P0    53W / 150W |   6796MiB /  7618MiB |     98%      Default |
| N/A   67C    P0   137W / 150W |   6796MiB /  7618MiB |     97%      Default |
| N/A   67C    P0   116W / 150W |   6796MiB /  7618MiB |     96%      Default |
| N/A   67C    P0   130W / 150W |   6796MiB /  7618MiB |     98%      Default |
| N/A   68C    P0    95W / 150W |   6796MiB /  7618MiB |     97%      Default |
| N/A   66C    P0   161W / 150W |   6796MiB /  7618MiB |     74%      Default |
| N/A   67C    P0   158W / 150W |   6796MiB /  7618MiB |     95%      Default |
| N/A   66C    P0   104W / 150W |   6796MiB /  7618MiB |     82%      Default |
| N/A   67C    P0    94W / 150W |   6796MiB /  7618MiB |     97%      Default |
| N/A   67C    P0   150W / 150W |   6796MiB /  7618MiB |     73%      Default |
| N/A   67C    P0   140W / 150W |   6796MiB /  7618MiB |     75%      Default |
| N/A   67C    P0   100W / 150W |   6796MiB /  7618MiB |     96%      Default |
| N/A   66C    P0    96W / 150W |   6796MiB /  7618MiB |     96%      Default |
| N/A   67C    P0   122W / 150W |   6796MiB /  7618MiB |     74%      Default |
| N/A   68C    P0   133W / 150W |   6796MiB /  7618MiB |     97%      Default |
| N/A   60C    P0    42W / 150W |      0MiB /  7618MiB |     97%      Default |
| N/A   58C    P0    42W / 150W |      0MiB /  7618MiB |    100%      Default |
| N/A   56C    P0    41W / 150W |      0MiB /  7618MiB |     97%      Default |
| N/A   55C    P0    41W / 150W |      0MiB /  7618MiB |     99%      Default |

(printing using bash command while true; do { nvidia-smi | grep Default; sleep 3; } done)

No theories for why this is happening and/or solutions? I think it’s weird that it continues with memory entirely unchnaged, for like ~170 batches, and then sudddenly Bam! oom

The code is here by the way: https://github.com/hughperkins/rationalizing-neural-predictions/tree/30edf139f2c6d89b99fb97dfaf82d1b44c5bfd57