I was using Pytorch to train a bilstm-crf model while encountered the OOM error. The batch size is 16.
I tried to redeuce the batch size to 8 → 4 ->2 ->1. But similar error happens again and again. This is my error. It seems that a lot of memory cannot be fully utilized. How can I fix it?
0%| | 1/872 [00:37<9:10:08, 37.90s/it]
0%| | 2/872 [00:57<6:33:41, 27.15s/it]
0%| | 2/872 [00:58<7:01:57, 29.10s/it]
Traceback (most recent call last):
File "/home/workspace/EvaHan/src/run.py", line 108, in <module>
main()
File "/home/workspace/EvaHan/src/run.py", line 103, in main
k_fold_run(config.trainset_path[0], vocab, device)
File "/home/workspace/EvaHan/src/run.py", line 88, in k_fold_run
test_loss, f1 = run(word_train, label_train, word_dev, label_dev, vocab, device, kf_index)
File "/home/workspace/EvaHan/src/run.py", line 61, in run
train(train_loader, dev_loader, vocab, model, device, kf_index)
File "/home/workspace/EvaHan/src/train.py", line 58, in train
out_feats = model.forward(batch_data)
File "/home/workspace/EvaHan/src/model.py", line 35, in forward
sequence_output, _ = self.bilstm(embeddings)
File "/home/tools/enter/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/tools/enter/envs/py38/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 691, in forward
result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: CUDA out of memory. Tried to allocate 14.96 GiB (GPU 0; 31.75 GiB total capacity; 15.45 GiB already allocated; 8.05 GiB free; 22.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF