Hello @ptrblck ,
Can you help me with the following error. The code works fine on the 2 T4 GPUs. But fails when run on the 4 L4 GPUs. I am extending the Gemma 2B model for a multi-label multi-class classification task. I am using the Jupyter NB:
notebook_launcher(main, args, num_processes = 4)
What I tried:
Some combinations of below methods.
os.environ["ACCELERATE_DISTRIBUTED_TYPE"] = "MULTI_GPU"
os.environ["ACCELERATE_BACKEND"] = "gloo"
os.environ["NCCL_DEBUG"] = "INFO"
os.environ["NCCL_DEBUG_SUBSYS"] = "ALL"
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
Error:
W1027 10:56:43.877000 133487515924288 torch/multiprocessing/spawn.py:146] Terminating process 1353 via signal SIGTERM
W1027 10:56:43.879000 133487515924288 torch/multiprocessing/spawn.py:146] Terminating process 1355 via signal SIGTERM
W1027 10:56:43.880000 133487515924288 torch/multiprocessing/spawn.py:146] Terminating process 1357 via signal SIGTERM
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] failed (exitcode: 1) local_rank: 0 (pid: 1351) of fn: main (start_method: fork)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] Traceback (most recent call last):
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 659, in _poll
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] self._pc.join(-1)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 189, in join
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] raise ProcessRaisedException(msg, error_index, failed_process.pid)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] torch.multiprocessing.spawn.ProcessRaisedException:
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] -- Process 0 terminated with the following error:
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] Traceback (most recent call last):
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] fn(i, *args)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 583, in _wrap
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] ret = record(fn)(*args_)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] return f(*args, **kwargs)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] File "/tmp/ipykernel_1202/2866542464.py", line 11, in main
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] model, optimizer, scheduler, train_loader, valid_loader = accelerator.prepare(
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1326, in prepare
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] result = tuple(
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] return self.prepare_model(obj, device_placement=device_placement)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1450, in prepare_model
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] model = torch.nn.parallel.DistributedDataParallel(
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] _verify_param_shape_across_processes(self.process_group, parameters)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] return dist._verify_params_across_processes(process_group, tensors, logger)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] torch.distributed.DistBackendError: NCCL error in: /usr/local/src/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] ncclUnhandledCudaError: Call to CUDA function failed.
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] Last error:
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] Cuda failure 'named symbol not found'
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]
---------------------------------------------------------------------------
ChildFailedError Traceback (most recent call last)
Cell In[28], line 1
----> 1 notebook_launcher(main, args, num_processes = 4)
File /opt/conda/lib/python3.10/site-packages/accelerate/launchers.py:245, in notebook_launcher(function, args, num_processes, mixed_precision, use_port, master_addr, node_rank, num_nodes, rdzv_backend, rdzv_endpoint, rdzv_conf, rdzv_id, max_restarts, monitor_interval, log_line_prefix_template)
243 if is_torch_version(">=", ELASTIC_LOG_LINE_PREFIX_TEMPLATE_PYTORCH_VERSION):
244 launch_config_kwargs["log_line_prefix_template"] = log_line_prefix_template
--> 245 elastic_launch(config=LaunchConfig(**launch_config_kwargs), entrypoint=function)(*args)
246 except ProcessRaisedException as e:
247 if "Cannot re-initialize CUDA in forked subprocess" in e.args[0]:
File /opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py:133, in elastic_launch.__call__(self, *args)
132 def __call__(self, *args):
--> 133 return launch_agent(self._config, self._entrypoint, list(args))
File /opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py:264, in launch_agent(config, entrypoint, args)
257 events.record(agent.get_event_succeeded())
259 if result.is_failed():
260 # ChildFailedError is treated specially by @record
261 # if the error files for the failed children exist
262 # @record will copy the first error (root cause)
263 # to the error file of the launcher process.
--> 264 raise ChildFailedError(
265 name=entrypoint_name,
266 failures=result.failures,
267 )
269 return result.return_values
270 except ChildFailedError:
ChildFailedError:
============================================================
main FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-27_10:56:43
host : efb4e6af3789
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1351)
error_file: /tmp/torchelastic_b96wk6_7/none_vewd1snk/attempt_0/0/error.json
traceback : Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/tmp/ipykernel_1202/2866542464.py", line 11, in main
model, optimizer, scheduler, train_loader, valid_loader = accelerator.prepare(
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1326, in prepare
result = tuple(
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1450, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: /usr/local/src/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'named symbol not found'
============================================================