Ubuntu : 16.04 server
Error : RuntimeError: unable to write to file </torch_18693_1954506624> at /pytorch/torch/lib/TH/THAllocator.c:271
I have encounted this error When run pytorch code in ubuntu server.
when debuging the code, i found the error occured at DataLoader.
__getitem__ method returned (img, label), the img’s type is ndarray. and i also tried returning img Tensor but in that condition, the process is blocked.
The code run properly at local, but failed at server.
What should i do to fix that?
Are you using Docker?
I had a similar issue and had to add the
Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with --ipc=host or --shm-size command line options to nvidia-docker run.
Hi, I use conda create env, pytorch1.2.0 cuda10.0, when train 2epochs, this problem happens, how can i solve it?
You might not have enough shared memory, so you could try to increase it on your system (or docker, if you are using it).
I would also recommend to update to the latest stable PyTorch version (
1.5) just in case you are hitting an older bug.
If you are using multiple workers in your
DataLoader, you could also try to set
num_workers=0 for the sake of debugging.
Thanks~ I kill other process, only run this pytorch task, this problem dispears. The reason is my system does not have enough shared memory. Thanks for your reply~
Where should I add the --ipc=host flag, notebook, or the command line.
--ipc=host should be passed as an argument to the
docker run command.
Is there a way to override the location of /dev/sm (shared memory) for PyTorch.
Reference for skelarn : https://stackoverflow.com/questions/40115043/no-space-left-on-device-error-while-fitting-sklearn-model.
Example : %env JOBLIB_TEMP_FOLDER=/tmp
Please suggest some alternatives
I’m not aware of a way to do so and would recommend to increase the shared memory, if your setup doesn’t provide a sufficiently large amount.
Unfortunately for me increasing the shared memory is not possible. Please suggest alternatives.
I don’t know alternatives to shared memory for multiprocessing IPC.
The fallback would be to use the main thread as for the data loading via
num_workers=0, but this would also reduce the performance.
Yes num_workers=0 works but takes a lot of time to train the model.