Unable to write to file </torch_18692_1954506624>

Ubuntu : 16.04 server
python3.6
pytorch:0.2.0_3

Error : RuntimeError: unable to write to file </torch_18693_1954506624> at /pytorch/torch/lib/TH/THAllocator.c:271

I have encounted this error When run pytorch code in ubuntu server.

when debuging the code, i found the error occured at DataLoader.

The dataset’s __getitem__ method returned (img, label), the img’s type is ndarray. and i also tried returning img Tensor but in that condition, the process is blocked.

The code run properly at local, but failed at server.

What should i do to fix that?

Thanks!

2 Likes

Are you using Docker?
I had a similar issue and had to add the --ipc=host flag.

Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with --ipc=host or --shm-size command line options to nvidia-docker run.

18 Likes

Solved my problem:+1:

Hi, I use conda create env, pytorch1.2.0 cuda10.0, when train 2epochs, this problem happens, how can i solve it?

You might not have enough shared memory, so you could try to increase it on your system (or docker, if you are using it).
I would also recommend to update to the latest stable PyTorch version (1.5) just in case you are hitting an older bug.

If you are using multiple workers in your DataLoader, you could also try to set num_workers=0 for the sake of debugging.

Thanks~ I kill other process, only run this pytorch task, this problem dispears. The reason is my system does not have enough shared memory. Thanks for your reply~

Where should I add the --ipc=host flag, notebook, or the command line.

--ipc=host should be passed as an argument to the docker run command.

1 Like

Is there a way to override the location of /dev/sm (shared memory) for PyTorch.

Reference for skelarn : https://stackoverflow.com/questions/40115043/no-space-left-on-device-error-while-fitting-sklearn-model.
Example : %env JOBLIB_TEMP_FOLDER=/tmp

Please suggest some alternatives

I’m not aware of a way to do so and would recommend to increase the shared memory, if your setup doesn’t provide a sufficiently large amount.

Unfortunately for me increasing the shared memory is not possible. Please suggest alternatives.

I don’t know alternatives to shared memory for multiprocessing IPC.
The fallback would be to use the main thread as for the data loading via num_workers=0, but this would also reduce the performance.

Yes num_workers=0 works but takes a lot of time to train the model.
Thanks !!

Do you know if there is a way to do this in Airflow? I am running into the same error but am not sure how to provide this argument when using a kubernetes pod operator - airflow.contrib.operators.kubernetes_pod_operator — Airflow Documentation

Unfortunately not as I’m not familiar with airflow.

Thanks for replying, if anyone else runs into this issue with airflow, I was able to solve the issue by following this suggestion.

In a Kubernetes pod operator the arguments looked like this:

from kubernetes.client import models as k8s_models


volumes=[
    k8s_models.V1Volume(
        name="dshm",
        empty_dir=k8s_models.V1EmptyDirVolumeSource(medium="Memory"),
    )],
volume_mounts=[
    k8s_models.V1VolumeMount(
        name="dshm",
        mount_path="/dev/shm",
    )],

hello sir, do you have any idea if i came across this problem with kubernotes?

Maybe since kubernetes orchestrates containers and the error is raised by a lack of shared memory.