I am trying to fine-tune the BART model from transformers for language generation on a custom dataset (30K examples of 256 length. <5MB on disk).
I have followed the Data parallelism guide. Here are the relevant parts of my code
args.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") if args.n_gpu > 1: model = nn.DataParallel(model) model.to(args.device) # Training args.per_gpu_train_batch_size * max(1, args.n_gpu) for step, batch in enumerate(epoch_iterator): model.train() batch = tuple(t.to(args.device) for t in batch)
I am facing a CUDA: Out of memory issue when using a batch size (per gpu) of 4 on 2 gpus. However training works fine on a single GPU. I am trying to train on 2 Titan-X gpus with 12GB memory.
This is the error message
RuntimeError: CUDA out of memory. Tried to allocate 394.00 MiB (GPU 0; 11.93 GiB total capacity; 10.84 GiB already allocated; 289.81 MiB free; 277.07 MiB cached) (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:267)
If it helps, I am using AdamW optimizer, with linear warmup.
I have tried setting CUDA_VISIBLE_DEVICES, which gives the same error.
Am I missing something?