Concurrent Multi GPU batch inferencing

Hello!
I would like to do inference (with gpu) by batches of audio data while concurently saving the outputs to memory with the cpu. For this, I’ve read many chats here and in other pages, and so far I came up with this:

Dataset:

def create_dataset(path, seed=123):
  paths = rglob_audio_files(path)    # This function iterates over the directories and get the lists of paths from the data (e.g. ../../...wav, .mp3, .flac, etc)
  random.Random(seed).shuffle(paths)
  ds = Dataset(paths, training=False)
  return ds

Dataloader:

def create_dataloader(path, batch_size, world_size):
    ds = create_dataset(path=path)
    
    dl = DataLoader(
        ds,
        batch_size=batch_size,
        shuffle=True,
        num_workers=world_size,
        collate_fn=ds.collate_fn
    )
    return dl

Parallel Inference Function:

def inference(model, path, out_path, batch_size, device, world_size):
    dist.init_process_group("nccl", rank=device, world_size=world_size)
    pool = concurrent.futures.ThreadPoolExecutor(max_workers=world_size) 
    loader = create_dataloader(path,batch_size,world_size)
    def save(out_path,data):
            for d in data:
                torchaudio.save(out_path, d[None], sampling_rate)
    for data in loader:
        output_data = model(data)[0].cpu()
        out_path = out_path / path.relative_to(path) # I don't know how to get the original path here 
        for i in range(0,world_size,batch_size/world_size):
            if i+batch_size/world_size<batch_size:
                pool.submit(save,out_path,output_data[i:i+batch_size/world_size])
            else:
                pool.submit(save,out_path,output_data[i:])

Main:

def main(in_path,out_path,batch_size,world_size):
  def _inference(rank,world_size):
        return inference(in_path,out_path,batch_size,rank,world_size)
  mp.spawn(_inference,
                    args=(args.world_size, ),
                    nprocs = args.world_size)

Can somebody tell me if this is the correct form of parallelizing and doing this inference concurrently while having the cpu saving the output files? Or does the dataloader already do this?
I also need someone to advice me how to get the original path in my inference function, since to get this path the only thing that occurs to me is that I would need another “for” iterating over a tqdm of the path and that would also need to be in parallel (I guess). If something is not clear about my questions, please ask.