Efficiently process gigantic batch size and small spatial dimension

I tried to make a Fast R-CNN kind of pipeline using ResNets. Output of RoI pooling would be something like [4 x 1000 x 256 x 14 x 14] (where 4 is batch size, 1000 is number of RoIs, 256 is number of feature maps, 14 x 14 is spatial dimension).

Putting this through through last ResNet block is very memory hungry and slow. Is there a way around? Any tiling scheme?

Thanks!

try cudnn.benchmark=True maybe?
Generally, I presume small batch size is not very well tuned.