After running many batches, I get an oom with lstm:
t....................THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1502009910772/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
File "train.py", line 353, in <module>
run(**args.__dict__)
File "train.py", line 271, in run
loss.backward(rationale_selected_node)
File "/mldata/conda/envs/pytorch/lib/python3.6/site-packages/torch/autograd/variable.py", line 156, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/mldata/conda/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 98, in backward
variables, grad_variables, retain_graph)
File "/mldata/conda/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 291, in _do_backward
result = super(NestedIOFunction, self)._do_backward(gradients, retain_variables)
File "/mldata/conda/envs/pytorch/lib/python3.6/site-packages/torch/autograd/function.py", line 299, in backward
result = self.backward_extended(*nested_gradients)
File "/mldata/conda/envs/pytorch/lib/python3.6/site-packages/torch/nn/_functions/rnn.py", line 313, in backward_extended
self._reserve_clone = self.reserve.clone()
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1502009910772/work/torch/lib/THC/generic/THCStorage.cu:66
The weird thing is, I printed the gpu memory every 3 seconds whilst it was running, and nothing odd at the moment this occurs:
| N/A 68C P0 151W / 150W | 6796MiB / 7618MiB | 89% Default |
| N/A 67C P0 137W / 150W | 6796MiB / 7618MiB | 76% Default |
| N/A 68C P0 102W / 150W | 6796MiB / 7618MiB | 74% Default |
| N/A 68C P0 158W / 150W | 6796MiB / 7618MiB | 98% Default |
| N/A 67C P0 118W / 150W | 6796MiB / 7618MiB | 97% Default |
| N/A 67C P0 107W / 150W | 6796MiB / 7618MiB | 96% Default |
| N/A 67C P0 122W / 150W | 6796MiB / 7618MiB | 96% Default |
| N/A 68C P0 115W / 150W | 6796MiB / 7618MiB | 97% Default |
| N/A 68C P0 135W / 150W | 6796MiB / 7618MiB | 96% Default |
| N/A 67C P0 139W / 150W | 6796MiB / 7618MiB | 76% Default |
| N/A 67C P0 121W / 150W | 6796MiB / 7618MiB | 98% Default |
| N/A 67C P0 141W / 150W | 6796MiB / 7618MiB | 74% Default |
| N/A 68C P0 160W / 150W | 6796MiB / 7618MiB | 96% Default |
| N/A 68C P0 101W / 150W | 6796MiB / 7618MiB | 97% Default |
| N/A 68C P0 159W / 150W | 6796MiB / 7618MiB | 81% Default |
| N/A 68C P0 140W / 150W | 6796MiB / 7618MiB | 75% Default |
| N/A 68C P0 144W / 150W | 6796MiB / 7618MiB | 75% Default |
| N/A 66C P0 80W / 150W | 6796MiB / 7618MiB | 95% Default |
| N/A 67C P0 108W / 150W | 6796MiB / 7618MiB | 75% Default |
| N/A 68C P0 131W / 150W | 6796MiB / 7618MiB | 75% Default |
| N/A 68C P0 135W / 150W | 6796MiB / 7618MiB | 76% Default |
| N/A 67C P0 102W / 150W | 6796MiB / 7618MiB | 95% Default |
| N/A 67C P0 53W / 150W | 6796MiB / 7618MiB | 98% Default |
| N/A 67C P0 137W / 150W | 6796MiB / 7618MiB | 97% Default |
| N/A 67C P0 116W / 150W | 6796MiB / 7618MiB | 96% Default |
| N/A 67C P0 130W / 150W | 6796MiB / 7618MiB | 98% Default |
| N/A 68C P0 95W / 150W | 6796MiB / 7618MiB | 97% Default |
| N/A 66C P0 161W / 150W | 6796MiB / 7618MiB | 74% Default |
| N/A 67C P0 158W / 150W | 6796MiB / 7618MiB | 95% Default |
| N/A 66C P0 104W / 150W | 6796MiB / 7618MiB | 82% Default |
| N/A 67C P0 94W / 150W | 6796MiB / 7618MiB | 97% Default |
| N/A 67C P0 150W / 150W | 6796MiB / 7618MiB | 73% Default |
| N/A 67C P0 140W / 150W | 6796MiB / 7618MiB | 75% Default |
| N/A 67C P0 100W / 150W | 6796MiB / 7618MiB | 96% Default |
| N/A 66C P0 96W / 150W | 6796MiB / 7618MiB | 96% Default |
| N/A 67C P0 122W / 150W | 6796MiB / 7618MiB | 74% Default |
| N/A 68C P0 133W / 150W | 6796MiB / 7618MiB | 97% Default |
| N/A 60C P0 42W / 150W | 0MiB / 7618MiB | 97% Default |
| N/A 58C P0 42W / 150W | 0MiB / 7618MiB | 100% Default |
| N/A 56C P0 41W / 150W | 0MiB / 7618MiB | 97% Default |
| N/A 55C P0 41W / 150W | 0MiB / 7618MiB | 99% Default |
(printing using bash command while true; do { nvidia-smi | grep Default; sleep 3; } done
)