I’m trying to reproduce the results from The state of sparsity in deep neural networks
using pytorch. In the paper it applies iterative pruning to the transformer network, i.e., it applies a certain amount of pruning every $N number of training steps. If I apply this to the Transformer model (from fairseq) for machine translation, around 3 epochs in, I am confronted with an OOM error as below:
Does anyone have a clue on why this might be happening?
The pseudo-code of my iterative pruning looks something like this:
for n in range(max_training_step):
trainer.train_step(...)
if pruning_condition:
trainer.prune_model(amount...)
torch.cuda.empty_cache() # I added this thinkg that it might help but it did not
- I ran the code without any pruning subroutines and it works fine.
- At the initiation of the trainer class, I iterate over all the modules in the model and append the specific weights I want to prune to a list called
self.prunable_modules
. For every pruning iteration, I simply callprune.global_unstructured(self.modules_to_prune, pruning_method=prune.RandomUnstructured, amount=0.2)
- There was an issues where when I was applying pruning to the modules the error
can't call backward twice, use retain_graph=True
for your first backward callappeared. So what I had to do was whenever there was a pruning iteration, I fed in
retain_graph=True` to the loss.backward call and this made that specific error to disappear.
Can you give me some suggestions? @Michela
This is what the training logs before raising OOM.
"wpb": "3414.7", "bsz": "106.9", "num_updates": "95950", "lr": "0.000102089", "gnorm": "2.053", "train_wall": "9", "wall": "16074"}
2020-08-05 21:20:24 | INFO | train_inner | {"epoch": 3, "update": 2.417, "sparsity": "0.508", "loss": "7.761", "nll_loss": "6.617", "ppl": "98.18", "wps": "19918.3", "ups": "5.77", "wpb": "3451.4", "bsz": "114.2", "num_updates": "96000", "lr": "0.000102062", "gnorm": "1.997", "train_wall": "9", "wall": "16082"}
2020-08-05 21:20:25 | INFO | fairseq.trainer | NOTE: Weights pruned, type: magnitude, amount: 0.017276782789559547, sparsity: 0.5162936572370858
2020-08-05 21:20:25 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 250.00 MiB (GPU 1; 15.78 GiB total capacity; 13.76 GiB already allocated; 63.19 MiB free; 14.54 GiB reserved in total by PyTorch)
Exception raised from malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:272 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2b99525931e2 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1e64b (0x2b995233464b in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1f464 (0x2b9952335464 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1faa1 (0x2b9952335aa1 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #4: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x11e (0x2b991a20f90e in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xf33949 (0x2b9918649949 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xf4d777 (0x2b9918663777 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x10e9c7d (0x2b990863ec7d in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x10e9f97 (0x2b990863ef97 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0xfa (0x2b9908749a1a in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x2eeaa8d (0x2b990a43fa8d in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #11: <unknown function> + 0x10e9f97 (0x2b990863ef97 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #12: at::empty(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0xfa (0x2b9908749a1a in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::native::zeros(c10::ArrayRef<long>, c10::TensorOptions const&) + 0x25 (0x2b99083c10c5 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x128b2f3 (0x2b99087e02f3 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x2eb3059 (0x2b990a408059 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x10ea319 (0x2b990863f319 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #17: at::zeros(c10::ArrayRef<long>, c10::TensorOptions const&) + 0xd5 (0x2b9908734fb5 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::generated::GatherBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x209 (0x2b990a279f89 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x3375bb7 (0x2b990a8cabb7 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x2b990a8c6400 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x2b990a8c6fa1 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #22: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x2b990a8bf119 in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #23: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x2b99064b54ba in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #24: <unknown function> + 0xc70f (0x2b990734d70f in /nfs_home/sohyongs/anaconda3/envs/fairseq/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #25: <unknown function> + 0x7dd5 (0x2b9873382dd5 in /usr/lib64/libpthread.so.0)
frame #26: clone + 0x6d (0x2b9873694ead in /usr/lib64/libc.so.6)
2020-08-05 21:20:25 | WARNING | fairseq.trainer | |===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Active memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| GPU reserved memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Allocations | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Active allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|===========================================================================|
2020-08-05 21:20:25 | WARNING | fairseq.trainer | |===========================================================================|
| PyTorch CUDA memory summary, device ID 1 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 1 | cudaMalloc retries: 37 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 14088 MB | 14457 MB | 555185 GB | 555171 GB |
| from large pool | 13696 MB | 14065 MB | 535265 GB | 535252 GB |
| from small pool | 392 MB | 542 MB | 19919 GB | 19919 GB |
|---------------------------------------------------------------------------|
| Active memory | 14088 MB | 14457 MB | 555185 GB | 555171 GB |
| from large pool | 13696 MB | 14065 MB | 535265 GB | 535252 GB |
| from small pool | 392 MB | 542 MB | 19919 GB | 19919 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 14890 MB | 14950 MB | 92649 GB | 92634 GB |
| from large pool | 14488 MB | 14488 MB | 91291 GB | 91277 GB |
| from small pool | 402 MB | 550 MB | 1358 GB | 1357 GB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 801 MB | 2085 MB | 497497 GB | 497496 GB |
| from large pool | 791 MB | 2049 MB | 477101 GB | 477100 GB |
| from small pool | 9 MB | 95 MB | 20395 GB | 20395 GB |
|---------------------------------------------------------------------------|
| Allocations | 1369 | 1615 | 204541 K | 204540 K |
| from large pool | 378 | 407 | 91557 K | 91556 K |
| from small pool | 991 | 1374 | 112984 K | 112983 K |
|---------------------------------------------------------------------------|
| Active allocs | 1369 | 1615 | 204541 K | 204540 K |
| from large pool | 378 | 407 | 91557 K | 91556 K |
| from small pool | 991 | 1374 | 112984 K | 112983 K |
|---------------------------------------------------------------------------|
| GPU reserved segments | 280 | 351 | 1649 K | 1649 K |
| from large pool | 79 | 85 | 954 K | 954 K |
| from small pool | 201 | 275 | 695 K | 695 K |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 68 | 180 | 126454 K | 126454 K |
| from large pool | 25 | 78 | 54307 K | 54307 K |
| from small pool | 43 | 126 | 72146 K | 72146 K |
|===========================================================================|
2020-08-05 21:20:25 | WARNING | fairseq.trainer | attempting to recover from OOM in forward/backward pass