Sync and batch-size with multi gpus

In the ResNet, I want to crop the feature map(10X10, with larger input images) before the avgpool and fc. And the crop boxes(with fixed size, say 5x5) have different locations for each input image. The following code is for the single gpu.

class ResNet(nn.Module):
    def __init__(self,parameters):
        super(ResNet, self).__init__()
    def forward(self, x, crop_info):
        x = self.conv1(x)
        x = self.layer4(x)

        # start 2 crop
        bs,ch,hgh,wid = x.size()
        crop_feat_map = torch.zeros(bs,ch,5,5).type(torch.cuda.FloatTensor)
        # len(crop_info) == bs
        for itr in range(bs):
            x0 = crop_info[itr, 0]
            y0 = crop_info[itr, 1]
            crop_feat_map[itr] = x[itr, :, y0:y0+5, x0:x0+5]

        x = self.avgpool(crop_feat_map)
        x = x.reshape(x.size(0), -1)
        x = self.fc(x)
        return x

When I use it with torch.nn.parallel.DistributedDataParallel, it is quite slow. I am not sure it is a proper way to crop feat_map with multi gpus, since the batch-size is kind of a global variable, and gpus do not share infomation during the forward() part.

For example there are two gpus, bs=8, with DDP. So each GPU has 4 input images. What would be the bs values(bs,ch,hgh,wid = x.size()) in the first GPU? If 4, everything is fine. If 8, without the extra sync code, how could the single GPU get the whole data?

If the bs part of the code is correct, how could I speed it up?
If the bs part of the code is wrong, what it the right way to do it?