Nms gpu cost time so more than caffe

Ryan95 · July 25, 2019, 12:59pm

Hi, have anyone compared the GPU time spent on NMS in caffe and pytorch? I recently transfer a model from caffe to libtorch. And I found that the main time spent by libtorch is concentrated on the NMS operation. the time is 160ms, which takes 80% of all forward time. but the overall time consumption of caffe is only 150ms!
The nms implementation I used is copying from torchvision0.3, and the maskrcnn-benchmark is also around 160ms. so Is there an operation that can continue to optimize nms? thank you

tom · July 25, 2019, 1:06pm

How did you post this twice?

I thought that the PyTorch NMS was ultimately derived from the (caffe2) detectron, so it seems strange that it should be that much slower.
You would have to show your benchmarking code before I would try to find out why it is different.

Best regards

Thomas

Ryan95 · July 25, 2019, 1:16pm

Hi,Tommas. I also want to post it, but unfortunately I am doing pvanet’s libtorch implementation, nms is only a small part of it. it may not be possible to upload the entire project. I just used the c++ chrono library to record the time spent in the nms function and compare it with the caffe version of pvanet. sorry for can’t provide you with more information.

tom · July 25, 2019, 3:20pm

Yeah, well, the usual way to do these things is to grab an input and try to measure the function in isolation.
Benchmarking has quite a few pitfalls, in particular with CUDA asynchronous computation involved, so it’s hard to say whether you found something where PyTorch is indeed terribly slow or whether you’ve just screwed up your benchmarking (I have certainly done that before, too).

Best regards

Thomas

Ryan95 · July 26, 2019, 1:24am

ok,I understand, Thanks a lot!

Ryan95 · July 26, 2019, 2:45am

Hi,tom. this is nms code that i copy from torchvision 0.3.
I found nms_kernel function is very fast, it’s only cost 0.01ms, but this move data operation

at::Tensor mask_cpu = mask.to(at::kCPU);

from gpu to cpu is very slow, because it spent 222.432ms. can i do something to reduce time consumption?

// boxes is a N x 5 tensor
at::Tensor nms_cuda(const at::Tensor boxes, float nms_overlap_thresh) {
  using scalar_t = float;
  AT_ASSERTM(boxes.type().is_cuda(), "boxes must be a CUDA tensor");
  at::cuda::CUDAGuard device_guard(boxes.device());

  auto scores = boxes.select(1, 4);
  auto order_t = std::get<1>(scores.sort(0, /* descending=*/true));
  auto boxes_sorted = boxes.index_select(0, order_t);

  int boxes_num = boxes.size(0);

  const int col_blocks = at::cuda::ATenCeilDiv(boxes_num, threadsPerBlock);

  at::Tensor mask =
      at::empty({boxes_num * col_blocks}, boxes.options().dtype(at::kLong));

  dim3 blocks(col_blocks, col_blocks);
  dim3 threads(threadsPerBlock);
  cudaStream_t stream = at::cuda::getCurrentCUDAStream();

  AT_DISPATCH_FLOATING_TYPES_AND_HALF(
      boxes_sorted.type(), "nms_kernel_cuda", [&] {
        nms_kernel<scalar_t><<<blocks, threads, 0, stream>>>(
            boxes_num,
            nms_overlap_thresh,
            boxes_sorted.data<scalar_t>(),
            (unsigned long long*)mask.data<int64_t>());
      });

  at::Tensor mask_cpu = mask.to(at::kCPU);
  unsigned long long* mask_host = (unsigned long long*)mask_cpu.data<int64_t>();

  std::vector<unsigned long long> remv(col_blocks);
  memset(&remv[0], 0, sizeof(unsigned long long) * col_blocks);

  at::Tensor keep =
      at::empty({boxes_num}, boxes.options().dtype(at::kLong).device(at::kCPU));
  int64_t* keep_out = keep.data<int64_t>();

  int num_to_keep = 0;
  for (int i = 0; i < boxes_num; i++) {
    int nblock = i / threadsPerBlock;
    int inblock = i % threadsPerBlock;

    if (!(remv[nblock] & (1ULL << inblock))) {
      keep_out[num_to_keep++] = i;
      unsigned long long* p = mask_host + i * col_blocks;
      for (int j = nblock; j < col_blocks; j++) {
        remv[j] |= p[j];
      }
    }
  }

  AT_CUDA_CHECK(cudaGetLastError());
  return
      order_t
          .index({keep.narrow(/*dim=*/0, /*start=*/0, /*length=*/num_to_keep)
                      .to(order_t.device(), keep.scalar_type())});
}

tom · July 26, 2019, 5:53am

Did you use torch.cuda.synchronize() before timing starts and ends?

Best regards

Thomas

Ryan95 · July 26, 2019, 6:00am

how can i do it in libtorch?

tom · July 26, 2019, 9:22am