🆘How can I set max_split_size_mb to avoid fragmentation？

Ruby-G0 · July 3, 2022, 3:03pm

I was using Pytorch to train a bilstm-crf model while encountered the OOM error. The batch size is 16.

I tried to redeuce the batch size to 8 → 4 ->2 ->1. But similar error happens again and again. This is my error. It seems that a lot of memory cannot be fully utilized. How can I fix it?

0%|          | 1/872 [00:37<9:10:08, 37.90s/it]
  0%|          | 2/872 [00:57<6:33:41, 27.15s/it]
  0%|          | 2/872 [00:58<7:01:57, 29.10s/it]
Traceback (most recent call last):
  File "/home/workspace/EvaHan/src/run.py", line 108, in <module>
    main()
  File "/home/workspace/EvaHan/src/run.py", line 103, in main
    k_fold_run(config.trainset_path[0], vocab, device)
  File "/home/workspace/EvaHan/src/run.py", line 88, in k_fold_run
    test_loss, f1 = run(word_train, label_train, word_dev, label_dev, vocab, device, kf_index)
  File "/home/workspace/EvaHan/src/run.py", line 61, in run
    train(train_loader, dev_loader, vocab, model, device, kf_index)
  File "/home/workspace/EvaHan/src/train.py", line 58, in train
    out_feats = model.forward(batch_data)
  File "/home/workspace/EvaHan/src/model.py", line 35, in forward
    sequence_output, _ = self.bilstm(embeddings)
  File "/home/tools/enter/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/tools/enter/envs/py38/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 691, in forward
    result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: CUDA out of memory. Tried to allocate 14.96 GiB (GPU 0; 31.75 GiB total capacity; 15.45 GiB already allocated; 8.05 GiB free; 22.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF