How multi-threading is managed within ddp?

sbelharbi · September 7, 2021, 11:38pm

@ptrblck any idea why this is happening?
this is the reason i was looking for a cuda extension in the other thread.

this is a cpu c++ code, and has nothing to do with the other thread.

basically, the function compute_on_cpu() calls the c++ function:

meng-tang/rloss/blob/1caa759e568db2c7209ab73e73ac039ea3d7101c/pytorch/wrapper/bilateralfilter/bilateralfilter.cpp#L42

    
      
                  for(int i=0;i<W*H;i++)
                      in_p[i] = in[i+k*W*H];
                  lattice.compute(out_p, in_p, 1);
                  for(int i=0;i<W*H;i++)
                      out[i+k*W*H] = out_p[i];
              }
              delete [] out_p;
              delete [] in_p;
          }
          
          
void bilateralfilter_batch(float * images, int len_images, float * ins, int len_ins, float * outs, int len_outs,
                                        int N, int K, int H, int W, float sigmargb, float sigmaxy){
              
              const int maxNumThreads = omp_get_max_threads();
              //printf("Maximum number of threads for this machine: %i\n", maxNumThreads);
              omp_set_num_threads(std::min(maxNumThreads,N));
              
              #pragma omp parallel for
              for(int n=0;n<N;n++){
                  bilateralfilter(images+n*3*H*W, 3*H*W, ins+n*K*H*W, K*H*W, outs+n*K*H*W, K*H*W,
                                        H, W, sigmargb, sigmaxy);

i call this function in pytorch after wrapping it in python using swig.
this c++ function creates a parallel region. i assume that each thread will deal with a sample in the minibatch in the lop.

this is actually fast (70ms).
when using ddp+2gpus, the multi-threadins does not seem to work (call 500ms == time when using only 1 thread).

the maximum threads on that machine is 48.
batch size 32.

any idea why?

thank you very much for your help!