I am trying to run a pytorch script via slurm. I have a simple pytorch script to create random numbers and store them in a txt file. However, I get error from slurm as:
from torch._C import * # noqa: F403
ImportError: libtorch_cpu.so: failed to map segment from shared object: Cannot allocate memory
srun: error: node138: task 0: Exited with exit code 1
Here is the simple pytorch code:
import torch
with open('sample.txt', 'w') as f:
for i in range(100):
m = (torch.rand(1, 100))
print(m)
f.write(str(m))
f.write('\n')
f.close()
Pytorch works fine on my workstation without slurm but for my current use case I need to run a training via slurm hence the need for slurm.
When I used numpy, slurm works fine and no error, meaning there is likely issue with pytorch when running via slurm. Any idea or workaround this will be appreciated.
Hello, is that txt file accessible from all the nodes in the slurm cluster? Perhaps you could also try adding this flag os.environ["TORCH_SHOW_CPP_STACKTRACES"] = "1" to see if we can get a better stack trace and post as a github issue.
I figured out what the issue was. For some reason on our cluster, if you activate a python environment before activating allocating gpu, you get logged out of the environment and returned to the base environment hence unloading all modules including python or any other modules loaded.
So what I did was to load all needed modules after gpu allocation (in this case nvidia-cuda and python) and moved the python environment activation after the gpu allocation, just before the python script run. As seen below: