I’m training a simple distilBERT binary classifier from a pretrained model. I have a simple trainer set up file, that runs solo, or with torchrun.
It works fine without torchrun no matter which pytorch I have. However, when I upgrade pytorch from 2.0.1 to 2.1, when running it through torchrun it breaks with the following (cryptic error). Has anyone seen this before, or have any idea how to prevent it from happening in 2.1?
NCCL WARN Invalid config blocking attribute value -2147483648
Traceback (most recent call last):
File “/home/user/repos/user_skills_from_title_ensemble/model-interfaces/training_pipelines/model/stages/train_edu_job_binary_model.py”, line 212, in
model = train_edu_job_binary_model(training_data_input_path, model_output_path, num_epochs=num_epochs, num_steps=num_steps, learning_rate=learning_rate, resume_from_checkpoint=resume, train_percentage=train_percentage, train_test_split=train_test_split)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/user/repos/user_skills_from_title_ensemble/model-interfaces/training_pipelines/model/stages/train_edu_job_binary_model.py”, line 187, in train_edu_job_binary_model
trainer.train()
File “/home/user/.pyenv/versions/3.11.6/lib/python3.11/site-packages/transformers/trainer.py”, line 1645, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File “/home/user/.pyenv/versions/3.11.6/lib/python3.11/site-packages/transformers/trainer.py”, line 1756, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/user/.pyenv/versions/3.11.6/lib/python3.11/site-packages/accelerate/accelerator.py”, line 1350, in prepare
result = tuple(
^^^^^^
File “/home/user/.pyenv/versions/3.11.6/lib/python3.11/site-packages/accelerate/accelerator.py”, line 1351, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/user/.pyenv/versions/3.11.6/lib/python3.11/site-packages/accelerate/accelerator.py”, line 1226, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/user/.pyenv/versions/3.11.6/lib/python3.11/site-packages/accelerate/accelerator.py”, line 1477, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/user/.pyenv/versions/3.11.6/lib/python3.11/site-packages/torch/nn/parallel/distributed.py”, line 795, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File “/home/user/.pyenv/versions/3.11.6/lib/python3.11/site-packages/torch/distributed/utils.py”, line 265, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.14.3
ncclInvalidArgument: Invalid value for an argument.
Last error:
Invalid config blocking attribute value -2147483648
[2024-12-03 23:00:15,738] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3273295) of binary: /home/user/.pyenv/versions/3.11.6/bin/python3.11