What is the correct way to source an environment (such as conda or venv) for PyTorch multinode training on slurm? Should I source the environment from the main node only? Or do I have to source the environment for all of the nodes that I am training with?
as long as you have a shared filesystem between the main node and the runner nodes, i think you can just rely on the ENV propagating from the main node when you launch.
if you have another setup where the runner nodes have the env on their local disks, then you’d need to take care of setting up the ENV correctly on the runners before execution starts. (You’d have to override the ENV that came from the main node if it pointed to the path of the conda env on the main node)