Facebook BART Fine-tuning - Transformers CUDA error: CUBLAS_STATUS_NOT_INITIALIZE

LidorPrototype · April 27, 2023, 6:35am

I’ve upgraded my cluster and I got only this notebook running on top of it (448GB for worker and 448GB for driver, 2 to 10 workers) and I still get the same error.

If I change:

num_train_epochs=3
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,

Then I get the error of Out of memory, but when we check the memory, there are still memory free that this notebook does not use which is odd.