Pytorch fails to import when running script in slurm

exponential · September 6, 2022, 11:52am

I am trying to run a pytorch script via slurm. I have a simple pytorch script to create random numbers and store them in a txt file. However, I get error from slurm as:

    from torch._C import *  # noqa: F403
ImportError: libtorch_cpu.so: failed to map segment from shared object: Cannot allocate memory
srun: error: node138: task 0: Exited with exit code 1

Here is the simple pytorch code:

import torch

with open('sample.txt', 'w') as f:
	
	for i in range(100):
		m = (torch.rand(1, 100))
		print(m)
		f.write(str(m))

		f.write('\n')
f.close()

Here is the shell script:

#!/bin/sh

#  script.sh
#  
#
#  Created by exponential on 06/09/2022.
#
#BATCH --job-name=pstgfdamp
#Output and error
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.error
source tests/bin/activate
#SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE
#SBATCH --mail-user=<>@.ac.uk
#SBATCH --time=00:05:00
#SBATCH --no-requeue
#SBATCH --partition=k2-gpu
#SBATCH --NODES=20
#SBATCH --ntasks=500
#SBATCG --mem-per-cpu=10G

srun python3 sample.py

Pytorch works fine on my workstation without slurm but for my current use case I need to run a training via slurm hence the need for slurm.
When I used numpy, slurm works fine and no error, meaning there is likely issue with pytorch when running via slurm. Any idea or workaround this will be appreciated.

Many thanks.

H-Huang · September 6, 2022, 4:12pm

Hello, is that txt file accessible from all the nodes in the slurm cluster? Perhaps you could also try adding this flag os.environ["TORCH_SHOW_CPP_STACKTRACES"] = "1" to see if we can get a better stack trace and post as a github issue.

kumpera · September 6, 2022, 10:04pm

This looks like an environment issue beyond pytorch in either your job or your SLURM cluster.

This stackoverflow question cover common issues associated with this error: pandas - How to fix Python error "...failed to map segment from shared object" appearing when I try to import NumPy library on GCP? - Stack Overflow

exponential · September 13, 2022, 10:08am

Thank you all,

I figured out what the issue was. For some reason on our cluster, if you activate a python environment before activating allocating gpu, you get logged out of the environment and returned to the base environment hence unloading all modules including python or any other modules loaded.

So what I did was to load all needed modules after gpu allocation (in this case nvidia-cuda and python) and moved the python environment activation after the gpu allocation, just before the python script run. As seen below:

#!/bin/sh

#  script.sh
#  
#
#  Created by exponential on 06/09/2022.
#
#BATCH --job-name=pstgfdamp
#Output and error
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.error
#SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE
#SBATCH --mail-user=<>@.ac.uk
#SBATCH --time=00:05:00
#SBATCH --no-requeue
#SBATCH --partition=k2-gpu
#SBATCH --NODES=20
#SBATCH --ntasks=500
#SBATCG --mem-per-cpu=10G

export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
module load apps/python3/3.10.0/gcc-4.8.5
module load libs/nvidia-cuda/11.7.0/bin

source tests/bin/activate
srun python3 sample.py

This way pytorch gets imported and all the associated libraries/modules.