Unable to run Ray Tune + PyTorch with GPU

I am trying to run the tutorial from pytorch website: Hyperparameter tuning with Ray Tune — PyTorch Tutorials 2.5.0+cu124 documentation. I just copy pasted the exact code and it works (with cpus=2 for each trial and gpu=0). When I change this code to have gpus_per_trial=1 and run it on my LSF node with 2 gpus available, i’m getting the following error.

“Trial train_cifar_2f9a7_00000 errored after 0 iterations at 2024-12-05 17:35:52. Total running time: 41s
Error file: /scratch/ray/session_2024-12-05_17-34-54_184213_2004252/artifacts/2024-12-05_17-35-10/train_cifar_2024-12-05_17-34-54/driver_artifacts/train_cifar_2f9a7_00000_0_batch_size=2,lr=0.0002_2024-12-05_17-35-10/error.txt
(train_cifar pid=2010889) Files already downloaded and verified
(train_cifar pid=2008282) ray::ImplicitFunc.train: symbol lookup error: /mxg-hpc/users/dpa13/miniforge3/envs/gnn/lib/python3.10/site-packages/torch/lib/…/…/nvidia/cudnn/lib/libcudnn_cnn_infer.so.8: undefined symbol: _Z20traceback_iretf_implPKcRKN5cudnn16InternalStatus_tEb, version libcudnn_ops_infer.so.8
2024-12-05 17:35:57,088 ERROR tune_controller.py:1331 – Trial task failed for trial train_cifar_2f9a7_00001
Traceback (most recent call last):
File “/mxg-hpc/users/dpa13/miniforge3/envs/gnn/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py”, line 110, in resolve_future
result = ray.get(future)
File “/mxg-hpc/users/dpa13/miniforge3/envs/gnn/lib/python3.10/site-packages/ray/_private/auto_init_hook.py”, line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File “/mxg-hpc/users/dpa13/miniforge3/envs/gnn/lib/python3.10/site-packages/ray/_private/client_mode_hook.py”, line 103, in wrapper
return func(*args, **kwargs)
File “/mxg-hpc/users/dpa13/miniforge3/envs/gnn/lib/python3.10/site-packages/ray/_private/worker.py”, line 2753, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File “/mxg-hpc/users/dpa13/miniforge3/envs/gnn/lib/python3.10/site-packages/ray/_private/worker.py”, line 906, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: ImplicitFunc
actor_id: 7cb8216284f275cadcb307fa01000000
pid: 2008282
namespace: d2f581b7-73a8-4123-83ae-9fa056840b24
ip: 10.7.23.206
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.”

I am at loss of not able to resolve it after spending some time as well. I checked that my miniforge env has torch 2.2.2+cu121. I tried running this code on A100 PCIE-40GB as well as H100 NVL (96 GB) machines. when I run “nvcc --version” command from my env, I get the CUDA version to be 11.5. what am i missing or any suggestions to run this tutorial sucessfully would be much appreciated. thanks!