Hi,
My pytorch project encountered different behaviors in pytorch v0.1.12 and v0.4.0 (Discussion here https://github.com/xingyizhou/pytorch-pose-hg-3d/issues/16 and https://github.com/bearpaw/pytorch-pose/issues/33, just for reference, there is no need to look into the project in this topic though) and I have a firm reason to believe the problem lies in the BN layer. For debug purpose, is it possible for me to locally switch the BN implementation back to v0.1.12 in pytorch v0.4.0? Or can anyone tell me or point me to the files the exact changes of BN layer from v0.1.12 to after v0.2.0? Thanks very much!
In 0.1.12
batch_norm
is defined here. The corresponding C function is defined here.
I’ve compared the BatchNorm
implementation for version 0.3.0
(code).
Between these versions there are some minor changes:
-
long
was replaces byint64_t
-
THTensor_(resizeAs)(gradInput, input)
was moved a bit up
You can find the current implementation of BatchNorm
here.
Basically some conditions were added, since BatchNorm
supports track_running_stats
from version 0.4.0
on.
Are you sure the difference is due to the BatchNorm
layers?
Thanks for the timely reply! It’s very helpful. My reason for blaming BN layers is that I observe unreasonably large output (>1000) after BN layer with model.eval() is on. Also, models without BN layers does not suffer the discrepancy between Pytorch versions. I will try to hack the BN implementation and see if it resolves the problem.
Hi ptrblck,
I am wondering, is the C code you provided actually running as BN in the default pytorch setting (when torch.backends.cudnn.enabled
is True) or it is the hidden cudnn library? Thanks!
Under certain conditions it uses cudnn. In 0.4 this is checked here. You could grep for the magic size to find it in other versions.
Best regards
Thomas