Nms gpu cost time so more than caffe

Hi, have anyone compared the GPU time spent on NMS in caffe and pytorch? I recently transfer a model from caffe to libtorch. And I found that the main time spent by libtorch is concentrated on the NMS operation. the time is 160ms, which takes 80% of all forward time. but the overall time consumption of caffe is only 150ms!
The nms implementation I used is copying from torchvision0.3, and the maskrcnn-benchmark is also around 160ms. so Is there an operation that can continue to optimize nms? thank you

How did you post this twice?

I thought that the PyTorch NMS was ultimately derived from the (caffe2) detectron, so it seems strange that it should be that much slower.
You would have to show your benchmarking code before I would try to find out why it is different.

Best regards


Hi,Tommas. I also want to post it, but unfortunately I am doing pvanet’s libtorch implementation, nms is only a small part of it. it may not be possible to upload the entire project. I just used the c++ chrono library to record the time spent in the nms function and compare it with the caffe version of pvanet. sorry for can’t provide you with more information. :pray::pray:

Yeah, well, the usual way to do these things is to grab an input and try to measure the function in isolation.
Benchmarking has quite a few pitfalls, in particular with CUDA asynchronous computation involved, so it’s hard to say whether you found something where PyTorch is indeed terribly slow or whether you’ve just screwed up your benchmarking (I have certainly done that before, too).

Best regards


ok,I understand, Thanks a lot!

Hi,tom. this is nms code that i copy from torchvision 0.3.
I found nms_kernel function is very fast, it’s only cost 0.01ms, but this move data operation

at::Tensor mask_cpu = mask.to(at::kCPU);

from gpu to cpu is very slow, because it spent 222.432ms. can i do something to reduce time consumption?

// boxes is a N x 5 tensor
at::Tensor nms_cuda(const at::Tensor boxes, float nms_overlap_thresh) {
  using scalar_t = float;
  AT_ASSERTM(boxes.type().is_cuda(), "boxes must be a CUDA tensor");
  at::cuda::CUDAGuard device_guard(boxes.device());

  auto scores = boxes.select(1, 4);
  auto order_t = std::get<1>(scores.sort(0, /* descending=*/true));
  auto boxes_sorted = boxes.index_select(0, order_t);

  int boxes_num = boxes.size(0);

  const int col_blocks = at::cuda::ATenCeilDiv(boxes_num, threadsPerBlock);

  at::Tensor mask =
      at::empty({boxes_num * col_blocks}, boxes.options().dtype(at::kLong));

  dim3 blocks(col_blocks, col_blocks);
  dim3 threads(threadsPerBlock);
  cudaStream_t stream = at::cuda::getCurrentCUDAStream();

      boxes_sorted.type(), "nms_kernel_cuda", [&] {
        nms_kernel<scalar_t><<<blocks, threads, 0, stream>>>(
            (unsigned long long*)mask.data<int64_t>());

  at::Tensor mask_cpu = mask.to(at::kCPU);
  unsigned long long* mask_host = (unsigned long long*)mask_cpu.data<int64_t>();

  std::vector<unsigned long long> remv(col_blocks);
  memset(&remv[0], 0, sizeof(unsigned long long) * col_blocks);

  at::Tensor keep =
      at::empty({boxes_num}, boxes.options().dtype(at::kLong).device(at::kCPU));
  int64_t* keep_out = keep.data<int64_t>();

  int num_to_keep = 0;
  for (int i = 0; i < boxes_num; i++) {
    int nblock = i / threadsPerBlock;
    int inblock = i % threadsPerBlock;

    if (!(remv[nblock] & (1ULL << inblock))) {
      keep_out[num_to_keep++] = i;
      unsigned long long* p = mask_host + i * col_blocks;
      for (int j = nblock; j < col_blocks; j++) {
        remv[j] |= p[j];

          .index({keep.narrow(/*dim=*/0, /*start=*/0, /*length=*/num_to_keep)
                      .to(order_t.device(), keep.scalar_type())});

Did you use torch.cuda.synchronize() before timing starts and ends?

Best regards


how can i do it in libtorch?