@ptrblck any idea why this is happening?
this is the reason i was looking for a cuda extension in the other thread.
this is a cpu c++ code, and has nothing to do with the other thread.
basically, the function compute_on_cpu() calls the c++ function:
i call this function in pytorch after wrapping it in python using swig.
this c++ function creates a parallel region. i assume that each thread will deal with a sample in the minibatch in the lop.
this is actually fast (70ms).
when using ddp+2gpus, the multi-threadins does not seem to work (call 500ms == time when using only 1 thread).
the maximum threads on that machine is 48.
batch size 32.