Cuda communication failure

/home/jarvis/yes/envs/sol-fl/lib/python3.11/site-packages/torch/cuda/_init_.py:182: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0

I am getting this error when i am training my classifier head for more epochs and after i increased batch size to 16 or 32 from 8 . i am using torch 2.8.0 with cuda 12.4,does anyone know why this setup work sometimes and crash at times , should i upgrade to cuda 12.8 since torch 2.8.0 is built on top of it

Based on this it seems your setup shows issues initializing your GPU and I would recommend checking dmesg for any Xids which could point to a failure.

@ptrblck ,thank you for the reply, this error i getting udo] password for jarvis:
[ 1.022428] r8169 0000:02:00.0 eth0: RTL8168h/8111h, 08:bf:b8:08:ea:58, XID 541, IRQ 69
[ 3.642416] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.163.01 Tue Apr 8 12:41:17 UTC 2025
[ 3179.521531] NVRM: GPU at PCI:0000:01:00: GPU-ee28abd9-c58b-6235-e471-68ddd692cb21
[ 3179.521547] NVRM: Xid (PCI:0000:01:00): 31, pid=5131, name=python3, Ch 00000008, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_ESC faulted @ 0x2_00200000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[ 3470.577146] NVRM: Xid (PCI:0000:01:00): 31, pid=1513, name=modprobe, Ch 00000004, intr 00000000. MMU Fault: ENGINE HOST2 HUBCLIENT_ESC faulted @ 0x1_01010000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
(sol-fl) jarvis@jarvis:~/Master_Thesis/fed_rasp$ python

Xid 31 is a memory page fault caused by e.g. an illegal memory access.
Are you sure these Xids are close to when you are observing the issues in your script?

Use dmesg -T to get human-readable time stamps.

The issue was occuring when batch size was increased from 8 to 16 or 32 , i use torch 2.8.0 which requires cuda 12.8 , i was having cuda 12.4 and after updating to 12.8 the issue wasn’t solved, then after the batch size was assigned back to 8 the issue was solved , my training happens inside raspberry pie since i am training the model using federated learning. I think there is some issue with my script that is causing this memory overflow in gpu