Converting pre Pytorch1.0 code

chinmay5 · June 25, 2020, 7:35am

I have a PyTorch code written in PyTorch 0.4 and I want to upgrade it. The major part where I am getting stuck is with the CUDA kernels. I know that I need to use ATen library but I do not think it is very properly documented and hence as such I am completely stuck while trying to do the upgrade. This is just one of the files that I intend to change but feedback on it would act as a prototype for others. I would really appreciate if someone can tell me what changes need to be made

#include <THC/THC.h>
#include <TH/TH.h>
#include <math.h>
#include <stdio.h>

#include "cuda/nms_kernel.h"

extern THCState *state;

int gpu_nms(THLongTensor* keep, THLongTensor* num_out, THCudaTensor* boxes, float nms_overlap_thresh)
{
    THArgCheck(THLongTensor_isContiguous(keep), 0, "boxes must be contiguous");
    THArgCheck(THCudaTensor_isContiguous(state, boxes), 2, "boxes must be contiguous");

    //Number of ROIs
    int boxes_num = THCudaTensor_size(state, boxes, 0);
    int boxes_dim = THCudaTensor_size(state, boxes, 1);

    float* boxes_flat = THCudaTensor_data(state, boxes);

    // if not fullfill one block, take it
    const int col_blocks = DIVUP(boxes_num, threadsPerBlock);
    THCudaLongTensor * mask = THCudaLongTensor_newWithSize2d(state, boxes_num, col_blocks);
    unsigned long long* mask_flat = THCudaLongTensor_data(state, mask);
    //printf("boxes_num: %d", boxes_num);

    _nms(boxes_num, boxes_flat, mask_flat, nms_overlap_thresh);
    //mask_flat is [boxes_num, col_blocks] where col_blocks is the number of blocks, each position in mask flat
    //is a 64bit number, indicate overlapped (more than threshold) if 1

    THLongTensor * mask_cpu = THLongTensor_newWithSize2d(boxes_num, col_blocks);
    THLongTensor_copyCuda(state, mask_cpu, mask);
    THCudaLongTensor_free(state, mask);

    unsigned long long * mask_cpu_flat = THLongTensor_data(mask_cpu);

    THLongTensor * remv_cpu = THLongTensor_newWithSize1d(col_blocks);
    unsigned long long* remv_cpu_flat = THLongTensor_data(remv_cpu);
    THLongTensor_fill(remv_cpu, 0);

    long * keep_flat = THLongTensor_data(keep);
    long num_to_keep = 0;

    for (int i = 0; i < boxes_num; i++)
    {
        int nblock = i / threadsPerBlock;
        int inblock = i % threadsPerBlock;

        // if previous one does not have overlapping with me (i)
        if(!(remv_cpu_flat[nblock] & (1ULL << inblock)))
        {
            keep_flat[num_to_keep++] = i;
            unsigned long long *p = &mask_cpu_flat[0] + i * col_blocks;
            for (int j = nblock; j < col_blocks; j++)
             {
                remv_cpu_flat[j] |= p[j];
             }
        }
    }

    long * num_out_flat = THLongTensor_data(num_out);
    * num_out_flat = num_to_keep;

    THLongTensor_free(mask_cpu);
    THLongTensor_free(remv_cpu);
    return 1;
}

tom · June 25, 2020, 8:06am

The API has changed considerably (this looks like pre-0.4 code btw.), as your code looks more like the C backend of Torch7 (which is the starting point for PyTorch, but the API is much more modern).
Basically, the types TH*Tensor all become Tensor. Then TORCH_CHECK with args replaces THArgCheck. Functions TH*Tensor_foo(input, ...) typically are methods input.foo(...).
But you’ll have to fill in the details yourself ro use the pre-made NMS from TorchVision.

Best regards

Thomas

chinmay5 · June 25, 2020, 8:16am

Hi @tom thank you so much for your reply. I can not use nms from torchvision since I am working on 3d data and not 2d images. However, if you can still direct me to actual source code of nms then maybe I can draw some inspiration from it. My C++ skills are very basic and as such, I do need a lot of help in completing this task.

Thanks again,
Chinmay

tom · June 25, 2020, 8:29am

The CUDA part is in csrc/cuda/nms_cuda.cu, there is a cpu variant in a sibling directory and a bit of common glue code in a nms.h one directory up.

Best regards

Thomas

chinmay5 · June 25, 2020, 8:46am

Hi @tom it seems the link is broken can you please check. Apart from that, if I try to code the same thing using a pure PyTorch code and put the individual tensors to cuda, what would be your estimate of the speed difference? For instance, I have this numpy version which I can change to pytorch and check,

    while order.size > 0:
        i = order[0] # pick maxmum iou box
        keep.append(i)
        xx1 = np.maximum(x1[i], x1[order[1:]])
        yy1 = np.maximum(y1[i], y1[order[1:]])
        zz1 = np.maximum(z1[i], z1[order[1:]])
        xx2 = np.minimum(x2[i], x2[order[1:]])
        yy2 = np.minimum(y2[i], y2[order[1:]])
        zz2 = np.minimum(z2[i], z2[order[1:]])

        w = np.maximum(0.0, xx2 - xx1 + 1) # maximum width
        h = np.maximum(0.0, yy2 - yy1 + 1) # maxiumum height
        l = np.maximum(0.0, zz2 - zz1 + 1) # maxiumum length
        inter = w * h * l
        ovr = inter / (volume[i] + volume[order[1:]] - inter)

        inds = np.where(ovr <= thresh)[0]
        order = order[inds + 1]

Thanks again for your support

tom · June 27, 2020, 11:43am

So you have the IoU part up to ovr, which is “item by item” this typically can be sped up by combining into a single kernel. As this type of operation is typically memory bandwidth bound, so the speedup factor is roughly equal to the number of kernel calls (function calls) you had before. You could adapt that into returning a boolean mask. This could, however, be also automatically fused in the JIT, so it might not be worth doing C++ coding for that if that’s not your favorite thing to do.
I have a notebook on this (using IoU as an example) in the JIT part of my PyTorch, JIT, Android talk from December 2018 - a classic if you want.
The thresholding (which I’d write by boolean indexing returned from the JITed function rather than where/nonzero) is also largely done in the TorchVision kernel, doing this externally adds some overhead, but in the overall scheme of 3d convolutions that likely come before, it might not be that significant.

Best regards

Thomas

chinmay5 · June 27, 2020, 12:41pm

Hi @tom

Thank you so much for your response. Based on your ideas I am trying to change my code

@torch.jit.script
def nms_pytorch2(dets, thresh):
    overlap = bbox_overlap(dets, dets)

    treshold_matrix = torch.tril((overlap > thresh), diagonal=-1)

    # Tensor elements indicate whether box should be kept
    is_maximum = treshold_matrix.new_ones(dets.shape[0])

    # loop over all boxes with highest confidence in the scene
    # Apply this vectorized over all boxes in the batch.
    for box in treshold_matrix.unbind(-1):
        # Disable all other boxes in the same scene if the current box is not
        # disabled.
        is_maximum = is_maximum & ~box

        # Also disable the overlaps of boxes which getting disabled right now.
        treshold_matrix &= ~box.unsqueeze(-2)

    return is_maximum

But I get the error:-

torch.jit.frontend.NotSupportedError: unsupported kind of augumented assignment: BitAnd:
         # Also disable the overlaps of boxes which getting disabled right now.
        treshold_matrix &= ~box.unsqueeze(-2)
                        ~~ <--- HERE
    return is_maximum

Can you please help me here.

Thanks,
Chinmay

tom · June 27, 2020, 7:23pm

Just make that non-inplace.

chinmay5 · June 27, 2020, 7:24pm

Hi @tom
Yes I just figured it out. However, when I try to compare speeds from the CUDA version on a sample of 1000 random bounding boxes, my version is still 20 times slower. I tried to have it vectorized as much as possible so I am not sure why it is still so slow…

Thanks,
Chinmay

tom · June 27, 2020, 10:52pm

Yeah. If that bothers you in the overall picture, Probably go with a custom kernel.

Best regards

Thomas

Looottch · October 13, 2020, 9:07pm

How about those “THCAssertSameGPU()” “THError()”, should I replace them if I need to change a pytorch0.4 based c file into pytorch1.6 based c++ file