How multi-threading is managed within ddp?

@ptrblck any idea why this is happening?
this is the reason i was looking for a cuda extension in the other thread.

this is a cpu c++ code, and has nothing to do with the other thread.

basically, the function compute_on_cpu() calls the c++ function:

i call this function in pytorch after wrapping it in python using swig.
this c++ function creates a parallel region. i assume that each thread will deal with a sample in the minibatch in the lop.

this is actually fast (70ms).
when using ddp+2gpus, the multi-threadins does not seem to work (call 500ms == time when using only 1 thread).

the maximum threads on that machine is 48.
batch size 32.

any idea why?

thank you very much for your help!