Why is it a bad idea to use python's `concurrent.futures` with pytorch and how can I paralellize batch loading in RL?

I am trying to load a batch from a replay buffer with pytorch asyncronously while optimizing the model parameters and thereby hide the batch loading latency. The program I run is as follows:

for _ in range(100):
    begin = time.time()
    batch = sample_batch()
    batch_load += time.time() - begin
    begin = time.time()
    optimize(batch)
    optimize_time += time.time() - begin

When running this script, batch_load takes about 0.001 seconds and optimize_time about 0.009 seconds. To hide the latency of the batch_load (although it doesn’t take long in this program, it takes more time in another program which I would actually like to optimize), I thought I can use pythons concurrent.futuresmodule to acquire afuturefromsample_batchand load it whilstoptimize` is running. This program instead looks as follows:

with concurrent.futures.ProcessPoolExecutor(max_workers=12) as executor:
    for _ in range(100):
        begin = time.time()
        future = executor.submit(sample_batch)
        batch_load += time.time() - begin
        begin = time.time()
        optimize(batch)
        optimize_time += time.time() - begin
        batch = future.result()

This turned out to be a pretty bad idea. The data loading time increases to 0.085 seconds and the optimization time increases to 0.13 seconds.

Can somebody kindly educate me on why the second program is so much slower than the first? Furthermore, does somebody have any ideas on how to hide data loading latency? I appreciate any answers and suggestions very much!

As batch_load measures the latency of executor.submit, I assume that’s the overhead of ProcessPoolExecutor?

But it is still weird that the optimize() also increased a lot. Does optimize() run ops on GPU? If yes, you will need to either torch.cuda.synchronize() on that GPU, or use elapsed_time to measure the latency. Because, CUDA ops returns when the op is inserted to the stream instead of when the op is done.

Thank you @mrshenli for your answer!

Indeed, the slower run time was caused entirely by the overhead of the ProcessPoolExecutor. It is interesting that this context has implications also for non-asynchronous procedure calls. I measured the entire program again with longer-running tasks and the overhead of the ProcessPoolExecutor seemed to be constant but the latency of data loading could be hid below the optimize call.

Again, thank you for your reply - It helped me a lot!

1 Like