I’ve upgraded my cluster and I got only this notebook running on top of it (448GB for worker and 448GB for driver, 2 to 10 workers) and I still get the same error.
If I change:
num_train_epochs=3
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
Then I get the error of Out of memory
, but when we check the memory, there are still memory free that this notebook does not use which is odd.