/home/jarvis/yes/envs/sol-fl/lib/python3.11/site-packages/torch/cuda/_init_.py:182: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0
I am getting this error when i am training my classifier head for more epochs and after i increased batch size to 16 or 32 from 8 . i am using torch 2.8.0 with cuda 12.4,does anyone know why this setup work sometimes and crash at times , should i upgrade to cuda 12.8 since torch 2.8.0 is built on top of it
Xid 31 is a memory page fault caused by e.g. an illegal memory access.
Are you sure these Xids are close to when you are observing the issues in your script?
The issue was occuring when batch size was increased from 8 to 16 or 32 , i use torch 2.8.0 which requires cuda 12.8 , i was having cuda 12.4 and after updating to 12.8 the issue wasn’t solved, then after the batch size was assigned back to 8 the issue was solved , my training happens inside raspberry pie since i am training the model using federated learning. I think there is some issue with my script that is causing this memory overflow in gpu