Does anyone know what is the implementation difference between Caffe's and PyTorch's convolution?

Currently I am having slightly different result between Caffe and PyTorch implementation of the same network. They are only the same up to about 4-5 decimal places. However, because my network is a cascade of a few deep nets. The difference slowly adds up causing a few pixel difference in the final output (of the images that I have tested). They are very rare though, occurring at about 1 in 20,000 images. Otherwise, the final outputs are the same up to 5 decimal places.

Hence, I am wondering if anyone has the experience to know if it is impossible to make the output of these two frameworks the same up to maybe 10 decimal places, which will then unlikely to cause the final output to be different.

I have unit tested both convolutions and they are indeed different in the last few decimal places.

Hopefully someone knows the implementation difference so that I can implement the Caffe version in PyTorch myself.