Optimizing Position Sensitive ROI Pooling R-FCN

I am implementing my own version of DP-FCN, a deformable parts version of R-FCN (here)[https://arxiv.org/abs/1605.06409] R- FCN
[https://arxiv.org/abs/1707.06175] DP- FCN

As part of the computation of ROI sensitive pooling, I take in a feature map of size
(38, 38, 7, 7, 21), where 7 is the grid size, and 21 is the number of classes.

I then need to project a region proposal onto the spatial component of the feature map,
followed by a position sensitive pooling, where for each grid entry, for each class, I compute
an average pooling of the corresponding grid location (with some displacements as in deformable parts architectures).
Rois are in the form (x_min, y_min, x_max, y_max).

def computeROISensitivePooling(feature_map, rois, stride = 38):
    rois =  torch.floor_divide(rois, stride) # get the roi projected to the feature map
    rois = rois.type(torch.int16) # convert float bounding box coords to int
    grid_scores = self.getGridScores(rois, feature_map)
    return grid_scores

# Get the area to pool over for an roi in grid element i,j           
def getBinRange(roi, i, j):
    roi_width = roi[2] - roi[0] + 1
    roi_height = roi[3] - roi[1] + 1
    x_min = torch.floor(i*torch.true_divide(roi_width, self.k)).type(torch.int16)
    x_max = torch.ceil((i+1)*torch.true_divide(roi_width, self.k) - 1).type(torch.int16)
    y_min = torch.floor(j*torch.true_divide(roi_height, self.k)).type(torch.int16)
    y_max = torch.ceil((j+1)*torch.true_divide(roi_height, self.k) - 1).type(torch.int16)
    return x_min, x_max, y_min, y_max # these values are inclusive

def computeROIHeatMapScore(score_map, i, j, roi, class_num, bin_range, dx=0, dy=0):
     x_min = roi[0] + bin_range[0]
     y_min = roi[1] + bin_range[2]
     x_max = roi[0] + bin_range[1]
     y_max = roi[1] + bin_range[3]
     bin_width = x_max-x_min+1
     bin_height = y_max-y_min+1
     ROI_width = roi[2]-roi[0]+1
     ROI_height = roi[3]-roi[1]+1
     heat_map_score = torch.sum(score_map[x_min+dx:x_max+dx+1, y_min+dy:y_max+dy+1, i,
                                   j, class_num])

      heat_map_score /= float(bin_width*bin_height)
      heat_map_score -= computeDeformationCost(dx, dy, ROI_width, ROI_height)
      return (dx, dy), heat_map_score


    def computeDeformationCost(self, dx, dy, ROI_width, ROI_height):
        return self.regularization_parameter*((dx**2/ROI_width)+(dy**2/ROI_height))


    def computeOptimalDisplacementAndHeatMapScore(score_map, i, j, roi,
                                                  class_num, feature_map_width=38, feature_map_height=38):
        bin_x_min, bin_x_max, bin_y_min, bin_y_max = getBinRange(roi, i, j)
        bin_range = [bin_x_min, bin_x_max, bin_y_min, bin_y_max]
        heat_map_scores = torch.empty(5*5) # 5 x displacements , 5 y displacements
        displacements = []
        counter = 0
        for dx in [0, 1, 2, 3, 4]:
            for dy in [0, 1, 2, 3, 4]:
                if bin_x_max+dx < feature_map_width-1 and bin_y_max+dy < feature_map_height-1:
                    displacement, heat_map_score = computeROIHeatMapScore(score_map, grid_index_i,
                                                                       grid_index_j, roi, bin_range, dx, dy)
                    displacements.append(displacement)
                    heat_map_scores[counter] = heat_map_score
                    counter += 1
        max_index = torch.argmax(heat_map_scores)
        return displacements[max_index], heat_map_scores[max_index]

The issue I am having is that my code takes up to 8 seconds for a single region proposal, and since the original author’s batch size is 1 image with 64 region proposals, this is clearly way too slow. I am wondering if there are any obvious optimizations that may speed up execution of my code. I would be happy to provide any additional info or clarifications as this is very important for me.