Why can't I effectively parallelize my reinforcement learning programs using process based parallelism?

My objective is to run multiple reinforcement learning programs, using the Stable_Baselines3 library, at the same time. What I notice is that as I increase the number of programs, the iteration speed of the program gradually decreases, which is quite surprising since each program should be running on a different process (core).

Here is my program:

from joblib import Parallel, delayed

import gym
# from sbx import SAC
import torch

from stable_baselines3 import SAC
def train():


    env = gym.make("Humanoid-v4")

    model = SAC("MlpPolicy", env, verbose=1)
    model.learn(total_timesteps=7e5, progress_bar=True)

def train_model():

    train()



if __name__ == '__main__':
    num_of_programs = 1
    Parallel(n_jobs=10)(delayed(train)() for i in range(num_of_programs))

num_of_programs is used to control the number of programs I am trying to run in parallel. Here are some statistics -

    Number of programs  Iteration speed
1   1   ~102 it/s
2   3   ~60 it/s
3   10  ~ 20 it/s

I made sure to request enough resources so that there isn’t a resource constraint. This is how I request my resources using slurm - srun --time=10:00:00 --nodes=1 --cpus-per-task=16 --mem=32G --partition=gpu --gres=gpu:a100-pcie:1 --pty /usr/bin/bash

Therefore I have 16 cpus, 32G memory and a 40 GB GPU.

I noticed the same issue when I moved from stable_baselines3 to sbx. While stable_baselines3 using torch as its deep learning library, the latter uses JAX.