How multi-threading is managed within ddp?

i have a c++ loss-wrapped in python.
here is some stats:
in all these cases, ddp is used. but we can choose to use one or two gpus.

here we show the forward time in the loss. more specifically, part of the code in the forward.
that part operates on cpu. so, gpu is not involved since we convert the output gpu tensor from previous computation to cpu().numpy(). then, computations are carried on cpu.
time is measured using

def forward(x):
    start_event = torch.cuda.Event(enable_timing=True) 
    end_event = torch.cuda.Event(enable_timing=True)
    # cpu region ---
    # cpu region ---
    elapsed_time_ms = start_event.elapsed_time(end_event)
    print('time cpu: {}'.format(elapsed_time_ms))

1 gpu:

  • multi-threading is on: 70ms
  • multi-threading is off: 500ms.


  • multi-threading is on: 500ms
  • multi-threading is off: 500ms.

the loss uses the maximum number of threads (openmp omp_get_max_threads()) which is 48 in this case.

it looks like when using multi-processes (multi-gpus), multi-threading is not working…
any idea why (gil bottleneck???)? and how to fix this?
the whole point of using ddp is to speedup computations.


openmp threads are mostly for cpu computation, for GPU computation, you need to make sure tensors and computation are on GPUs, it will launch CUDA kernels and compute in parallel.

but this still does not answer the question to why c++ multi-threading seems to be turned off when using ddp + multigpus.
i dont not whether the threads of each process block each other or something else.

the rest of computation is done on gpu where all tensors live.
only compute_cpu() is done exclusively on cpu because it is simply a c++ cpu implementation. dont have yet a gpu implementation.

@ptrblck any idea why this is happening?
this is the reason i was looking for a cuda extension in the other thread.

this is a cpu c++ code, and has nothing to do with the other thread.

basically, the function compute_on_cpu() calls the c++ function:

i call this function in pytorch after wrapping it in python using swig.
this c++ function creates a parallel region. i assume that each thread will deal with a sample in the minibatch in the lop.

this is actually fast (70ms).
when using ddp+2gpus, the multi-threadins does not seem to work (call 500ms == time when using only 1 thread).

the maximum threads on that machine is 48.
batch size 32.

any idea why?

thank you very much for your help!

I’m not entirely sure, if you are seeing a warning or are setting the OMP threads too late, but here omp_num_threads will be set to 1.
Could you check, if this could the also the root cause for your observation?

you are so right!!!
distributed turned off the multithreading and the c++ function was too late when it asks for the max nbr threads which will be 1.

– begin side question
side question: how to ask ddp to log into a specific file?
the dd logging is in terminal which makes catching warnings difficult.
there are a lot of messages but they are lost.
is there a way to ask ddp to write logs in a specific file?
or can we properly manipulate the instance of log here to do that? thanks
the first time i used ddp, it starts throwing logs in terminal which is unpractical for debug.
there should be a way to tell ddp to log into a file.
the doc1 and doc2 do not seem to cover this.
i didnt investigate further as other things have more priority.

–end side question

so, now i configure export OMP_NUM_THREADS=32 before running and the runtime is back to 70ms with ddp+2gpus!!! this is cool!
thank you very much! this is a life saver!

also, i was getting this warning which is impossible to read because it is printed on terminal!

Setting OMP_NUM_THREADS environment variable for each process to 
be 1 in default, to avoid your system being overloaded, please further tune 
the variable for optimal performance in your application as needed.

which explains everything!

also, i am using distributed.launch which is explains why i am getting this warning

The module torch.distributed.launch is deprecated and going to be removed
 in future.Migrate to

in the examples they provided, they use launch. i should probably switch to run, it seems more up to date.
also, they used launch in this under Launch utility. probably the doc needs to be updated probably in the next release. i am using pytorch 1.9.0.

again, thank you so much! this was very helpful!

The linked line of code sounds right and you could try to set the --log_dir as described in this argument.
Good to hear you’ve solved the threading issue.

thanks. that seems the way to tell run to log in file. thanks
i think i need to go through its docstring is well documented.
i expect this to be released in the next version because it does not exit in the current doc.

again, thank you very much for your help!!!

not urgent, but i tried loggin into file with ddp.
it worked partially.
something else is still logging into terminal and not file.
here in a separate thread.
this may be still under dev.