Cuda failure 'named symbol not found' when run on 4 L4 GPUs

Hello @ptrblck ,
Can you help me with the following error. The code works fine on the 2 T4 GPUs. But fails when run on the 4 L4 GPUs. I am extending the Gemma 2B model for a multi-label multi-class classification task. I am using the Jupyter NB:

notebook_launcher(main, args, num_processes = 4)

What I tried:
Some combinations of below methods.

os.environ["ACCELERATE_DISTRIBUTED_TYPE"] = "MULTI_GPU"
os.environ["ACCELERATE_BACKEND"] = "gloo"
os.environ["NCCL_DEBUG"] = "INFO"
os.environ["NCCL_DEBUG_SUBSYS"] = "ALL"
os.environ["CUDA_LAUNCH_BLOCKING"] = "1" 
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

Error:

W1027 10:56:43.877000 133487515924288 torch/multiprocessing/spawn.py:146] Terminating process 1353 via signal SIGTERM
W1027 10:56:43.879000 133487515924288 torch/multiprocessing/spawn.py:146] Terminating process 1355 via signal SIGTERM
W1027 10:56:43.880000 133487515924288 torch/multiprocessing/spawn.py:146] Terminating process 1357 via signal SIGTERM
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] failed (exitcode: 1) local_rank: 0 (pid: 1351) of fn: main (start_method: fork)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] Traceback (most recent call last):
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 659, in _poll
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]     self._pc.join(-1)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]   File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 189, in join
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]     raise ProcessRaisedException(msg, error_index, failed_process.pid)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] torch.multiprocessing.spawn.ProcessRaisedException: 
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] 
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] -- Process 0 terminated with the following error:
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] Traceback (most recent call last):
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]   File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]     fn(i, *args)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 583, in _wrap
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]     ret = record(fn)(*args_)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]     return f(*args, **kwargs)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]   File "/tmp/ipykernel_1202/2866542464.py", line 11, in main
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]     model, optimizer, scheduler, train_loader, valid_loader = accelerator.prepare(
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]   File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1326, in prepare
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]     result = tuple(
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]   File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]   File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]     return self.prepare_model(obj, device_placement=device_placement)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]   File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1450, in prepare_model
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]     model = torch.nn.parallel.DistributedDataParallel(
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]   File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]     _verify_param_shape_across_processes(self.process_group, parameters)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702]     return dist._verify_params_across_processes(process_group, tensors, logger)
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] torch.distributed.DistBackendError: NCCL error in: /usr/local/src/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] ncclUnhandledCudaError: Call to CUDA function failed.
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] Last error:
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] Cuda failure 'named symbol not found'
E1027 10:56:44.033000 133487515924288 torch/distributed/elastic/multiprocessing/api.py:702] 

---------------------------------------------------------------------------
ChildFailedError                          Traceback (most recent call last)
Cell In[28], line 1
----> 1 notebook_launcher(main, args, num_processes = 4)

File /opt/conda/lib/python3.10/site-packages/accelerate/launchers.py:245, in notebook_launcher(function, args, num_processes, mixed_precision, use_port, master_addr, node_rank, num_nodes, rdzv_backend, rdzv_endpoint, rdzv_conf, rdzv_id, max_restarts, monitor_interval, log_line_prefix_template)
    243     if is_torch_version(">=", ELASTIC_LOG_LINE_PREFIX_TEMPLATE_PYTORCH_VERSION):
    244         launch_config_kwargs["log_line_prefix_template"] = log_line_prefix_template
--> 245     elastic_launch(config=LaunchConfig(**launch_config_kwargs), entrypoint=function)(*args)
    246 except ProcessRaisedException as e:
    247     if "Cannot re-initialize CUDA in forked subprocess" in e.args[0]:

File /opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py:133, in elastic_launch.__call__(self, *args)
    132 def __call__(self, *args):
--> 133     return launch_agent(self._config, self._entrypoint, list(args))

File /opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py:264, in launch_agent(config, entrypoint, args)
    257     events.record(agent.get_event_succeeded())
    259     if result.is_failed():
    260         # ChildFailedError is treated specially by @record
    261         # if the error files for the failed children exist
    262         # @record will copy the first error (root cause)
    263         # to the error file of the launcher process.
--> 264         raise ChildFailedError(
    265             name=entrypoint_name,
    266             failures=result.failures,
    267         )
    269     return result.return_values
    270 except ChildFailedError:

ChildFailedError: 
============================================================
main FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-27_10:56:43
  host      : efb4e6af3789
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1351)
  error_file: /tmp/torchelastic_b96wk6_7/none_vewd1snk/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
      return f(*args, **kwargs)
    File "/tmp/ipykernel_1202/2866542464.py", line 11, in main
      model, optimizer, scheduler, train_loader, valid_loader = accelerator.prepare(
    File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1326, in prepare
      result = tuple(
    File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
      self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
    File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
      return self.prepare_model(obj, device_placement=device_placement)
    File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1450, in prepare_model
      model = torch.nn.parallel.DistributedDataParallel(
    File "/opt/conda/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 822, in __init__
      _verify_param_shape_across_processes(self.process_group, parameters)
    File "/opt/conda/lib/python3.10/site-packages/torch/distributed/utils.py", line 286, in _verify_param_shape_across_processes
      return dist._verify_params_across_processes(process_group, tensors, logger)
  torch.distributed.DistBackendError: NCCL error in: /usr/local/src/pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:275, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
  ncclUnhandledCudaError: Call to CUDA function failed.
  Last error:
  Cuda failure 'named symbol not found'
  
============================================================