Facebook BART Fine-tuning - Transformers CUDA error: CUBLAS_STATUS_NOT_INITIALIZE

I’m trying to finetune the Facebook BART model, I’m following this article in order to classify text using my own dataset.

And I’m using the Trainer object in order to train:

training_args = TrainingArguments(
    output_dir=model_directory,      # output directory
    num_train_epochs=1,              # total number of training epochs - 3
    per_device_train_batch_size=4,  # batch size per device during training - 16
    per_device_eval_batch_size=16,   # batch size for evaluation - 64
    warmup_steps=50,                # number of warmup steps for learning rate scheduler - 500
    weight_decay=0.01,               # strength of weight decay
    logging_dir=model_logs,          # directory for storing logs
    logging_steps=10,
)

model = BartForSequenceClassification.from_pretrained("facebook/bart-base") # bart-large-mnli

trainer = Trainer(
    model=model,                          # the instantiated 🤗 Transformers model to be trained
    args=training_args,                   # training arguments, defined above
    compute_metrics=new_compute_metrics,  # a function to compute the metrics
    train_dataset=train_dataset,          # training dataset
    eval_dataset=val_dataset              # evaluation dataset
)

This is the tokenizer I used:

from transformers import BartTokenizerFast
tokenizer = BartTokenizerFast.from_pretrained('facebook/bart-base')

But when I use trainer.train() I get the following:

Printing the following:

***** Running training *****
  Num examples = 172
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 11

Followed by this error:

RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 1496, in forward
    outputs = self.model(
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 1222, in forward
    encoder_outputs = self.encoder(
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 846, in forward
    layer_outputs = encoder_layer(
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 323, in forward
    hidden_states, attn_weights, _ = self.self_attn(
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 191, in forward
    query_states = self.q_proj(hidden_states) * self.scaling
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

I’ve searched this site and GitHub and StackOverflow but still didn’t find anything that helped me fix this for me (I tried adding more memory, lowering batches and warmup, restarting, specifying CPU or GPU, and more, but none worked for me)

I’m using Databricks for that, with the cluster: Standard_NC24s_v3 with 4 GPUs, 2 to 6 workers

If you require any other information, comment and I’ll add it up asap

Similar to the other topic where you’ve linked this post you should check if you are running out of memory and thus cannot create a cuBLAS handle anymore.

I’ve upgraded my cluster and I got only this notebook running on top of it (448GB for worker and 448GB for driver, 2 to 10 workers) and I still get the same error.

If I change:

num_train_epochs=3
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,

Then I get the error of Out of memory, but when we check the memory, there are still memory free that this notebook does not use which is odd.

I didn’t mean the host RAM but the GPU memory, which will be smaller than ~448GB.

My GPU got 16GB memory, and its the max I can get for a GPU on Databricks right now, I tried adding os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:1000" but I still get the same errors, is there a way to use it in chunks or something like that?

With the above flag, the changes in the batch size, and everything from before the comment I get the following error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 0; 15.78 GiB total capacity; 14.59 GiB already allocated; 119.50 MiB free; 14.72 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@ptrblck

I don’t think a specific config setup would help here as it doesn’t seem as if your training would suffer from memory fragmentation. You would thus have to decrease the batch size or reduce the memory usage otherwise, e.g. via mixed-precision training, checkpointing etc.

I tried using a batch size of 1 even on both, still got the same error…
@ptrblck

Do you have any other ideas about it? @ptrblck

It seems even a single sample runs out of memory so you might need to reduce the memory usage even more e.g. via mixed-precision training or checkpointing.

I tried doing mixed with supplying the flag fp16=True → the same error
I tried doing checkpointing by using the flag gradient_checkpointing=True → the same error
I also tried using gradient_accumulation_steps=4 → the same error

I also tried all of the above separated and together, same error still @ptrblck

I was able to get it working only by passing the following:

  • num_labels with the actual number of labels to the from_pretrained
  • Making sure that labels order start from 0 and not from 1
  • passing ignore_mismatched_sizes=True to the from_pretrained
  • Allowing the flag gradient_checkpointing=True in the TrainingArguments
1 Like