Comparing BN output (FW and BW) on CPU and on GPU

Hey, I am writing my implementation of BN, and when I compare my implementation with the Pytorch implementation on CPU I obtain identical results with the Pytorch BN implementation, but when I set device to GPU, the FW passes are still aligned while the BW passes are not. What could it be the reason? I am also using the gradient check and I pass it, so I am not clear why I see different results when moving from CPU to GPU. Any ideas? Thanks!


How large is the difference between the two implementations?

Apologies, my bad. I did not realize that .clone() on the input tensor I used to feed the 2 lines (PT BN and mine) had effects on the backprop graph. Creating 2 indipendent inputs, I see that my BW has issues and the supposed alignment is no longer there. Thanks for the reply in any case :slight_smile: