Unable to write to file </torch_18692_1954506624>

KeithYin · November 15, 2017, 2:06pm

Ubuntu : 16.04 server
python3.6
pytorch:0.2.0_3

Error : RuntimeError: unable to write to file </torch_18693_1954506624> at /pytorch/torch/lib/TH/THAllocator.c:271

I have encounted this error When run pytorch code in ubuntu server.

when debuging the code, i found the error occured at DataLoader.

The dataset’s __getitem__ method returned (img, label), the img’s type is ndarray. and i also tried returning img Tensor but in that condition, the process is blocked.

The code run properly at local, but failed at server.

What should i do to fix that?

Thanks!

ptrblck · November 15, 2017, 2:48pm

Are you using Docker?
I had a similar issue and had to add the --ipc=host flag.

Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with --ipc=host or --shm-size command line options to nvidia-docker run.

YONG_ZHANG · October 15, 2019, 9:52pm

Solved my problem:+1:

gaily_sun · May 27, 2020, 6:10am

Hi, I use conda create env, pytorch1.2.0 cuda10.0, when train 2epochs, this problem happens, how can i solve it?

ptrblck · May 27, 2020, 8:08am

You might not have enough shared memory, so you could try to increase it on your system (or docker, if you are using it).
I would also recommend to update to the latest stable PyTorch version (1.5) just in case you are hitting an older bug.

If you are using multiple workers in your DataLoader, you could also try to set num_workers=0 for the sake of debugging.

gaily_sun · May 28, 2020, 2:20am

Thanks~ I kill other process, only run this pytorch task, this problem dispears. The reason is my system does not have enough shared memory. Thanks for your reply~

Ekansh · September 13, 2020, 5:54am

Where should I add the --ipc=host flag, notebook, or the command line.

ptrblck · September 13, 2020, 9:51pm

--ipc=host should be passed as an argument to the docker run command.

Ekansh · October 1, 2020, 3:01pm

Is there a way to override the location of /dev/sm (shared memory) for PyTorch.

Reference for skelarn : https://stackoverflow.com/questions/40115043/no-space-left-on-device-error-while-fitting-sklearn-model.
Example : %env JOBLIB_TEMP_FOLDER=/tmp

Please suggest some alternatives

ptrblck · October 1, 2020, 11:36pm

I’m not aware of a way to do so and would recommend to increase the shared memory, if your setup doesn’t provide a sufficiently large amount.

Ekansh · October 2, 2020, 3:08am

Unfortunately for me increasing the shared memory is not possible. Please suggest alternatives.

ptrblck · October 2, 2020, 5:32am

I don’t know alternatives to shared memory for multiprocessing IPC.
The fallback would be to use the main thread as for the data loading via num_workers=0, but this would also reduce the performance.

Ekansh · October 2, 2020, 6:36am

Yes num_workers=0 works but takes a lot of time to train the model.
Thanks !!

Brendan_Wee · September 22, 2023, 6:54pm

Do you know if there is a way to do this in Airflow? I am running into the same error but am not sure how to provide this argument when using a kubernetes pod operator - airflow.contrib.operators.kubernetes_pod_operator — Airflow Documentation

ptrblck · September 22, 2023, 11:58pm

Unfortunately not as I’m not familiar with airflow.

Brendan_Wee · September 25, 2023, 4:50pm

Thanks for replying, if anyone else runs into this issue with airflow, I was able to solve the issue by following this suggestion.

In a Kubernetes pod operator the arguments looked like this:

from kubernetes.client import models as k8s_models


volumes=[
    k8s_models.V1Volume(
        name="dshm",
        empty_dir=k8s_models.V1EmptyDirVolumeSource(medium="Memory"),
    )],
volume_mounts=[
    k8s_models.V1VolumeMount(
        name="dshm",
        mount_path="/dev/shm",
    )],

SystemErrorWang · March 1, 2024, 4:27am

hello sir, do you have any idea if i came across this problem with kubernotes?

ptrblck · March 1, 2024, 1:51pm

Maybe since kubernetes orchestrates containers and the error is raised by a lack of shared memory.