Turns out this is related to : Performance highly degraded when eval() is activated in the test phase
It’s a bug in pytorch’s definition of batchnorm according to those guys : https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/batchnorm.py
Their solution only partially solved my discrepancy.