Which is faster: Loading data in parallel or running multiple training processes in parallel but have data loaded sequentially?

I currently run an active learning algorithm. Let us say the size of the data is 30,000 samples. I start with 300. I train a model on these 300 images (7 epochs with early stopping let’s say). Then i add more samples (300 + 700 = 1000 samples). I retrain the model. I keep doing this for 15 Active Learning iterations. Total data used will be 300 + 700*(15-1) = 10,100 samples (roughly 1/3 of the original data). This is what I call an experiment.

Now, I need to run the above experiment 10 times to obtain a mean and std of the results. That being said, I am running into long run times. I used to use DataLoader with num_workers > 0 and pin_memory = True wait till an entire experiment is done then run this process 10 times total. Recently, I tried using multiprocessing to run say 3 experiments at a time but it does not allow me to run DataLoader with num_workers > 0. I also had to decrease batch size to around 16 or 8 to avoid GPU memory overload.

That being said, does anyone know what is the optimal way to go about this? Currently I have one GPU RTX 2080 but soon I will have two GPUs RTX TITAN (and maybe even better ones think L series or RTX 4090s). My priority right now is speed.