BatchNorm parameters after model reload

Hi,
(I use pytorch 1.3.1, and use a GPU V100)

Here is my problem (new one): I have a model with Batch Normalisation. During the training and testing phase (same script), at each epoch, I use model.train() before entering a loop on training batch sets to perform optimization and model.eval() before evaluating the current model. Also, at each epoch, I save the model state dictionnary as well as the optimizer, scheduler as

     state = { 
          'epoch': epoch + 1,
           'model_state_dict': model.state_dict(),
           'scheduler_state_dict': scheduler.state_dict(),
           'optimizer_state_dict': optimizer.state_dict()
      }
 torch.save(state,"model.pth")

Notice that before this saving phase, as I have performed a testing I have set model.eval().

I have recorded during this training/testing phase the statistics of the ouputs of the 3 last BN layers. Typically from the last epoch during the training I got for (min, max, mean, std)

last bn2:  -46.39485549926758 36.850852966308594 -0.39066851139068604 1.3487299680709839
last bn2:  -28.653053283691406 15.431465148925781 -0.6616867780685425 1.2987738847732544
last bn2:  -47.44236373901367 19.981163024902344 -0.9964369535446167 1.3029693365097046

and the testing

last bn2:  -58.15468215942383 40.56957244873047 -0.29340481758117676 1.1473218202590942
last bn2:  -21.610492706298828 14.811783790588379 -0.47804850339889526 1.0458149909973145
last bn2:  -70.31809997558594 41.03254318237305 -0.8271327018737793 1.351212978363037

(notice that these layers are followed by ReLU activation).

Ok so far so good. Now, in an other script, I load the model and set it to model.eval(),

model.to(device)
checkpoint = torch.load(checkpoint_file)
model.load_state_dict(checkpoint['model_state_dict'])

Then on the same sets (both train and test) as used in the previous script, I process some samples and got

train[ 9 ]
last bn2:  -365.64263916015625 177.18020629882812 -0.2252720594406128 1.798872470855713
last bn2:  -135.1555938720703 49.426048278808594 -1.0530227422714233 1.5416172742843628
last bn2:  -89.00902557373047 36.76680374145508 -0.7778999209403992 0.9938169121742249

for a training sample and

test [ 9 ]
last bn2:  -240.713134765625 114.61311340332031 -0.19601963460445404 1.3636577129364014
last bn2:  -108.94783782958984 55.55070114135742 -1.0378086566925049 1.46784508228302
last bn2:  -100.44957733154297 29.672212600708008 -0.7665001153945923 0.9042347073554993

for a testing sample.

As u can see the stats are completely different !

Can you give me some hints to understand why there is this difference eg.:

  1. bad saving procedure
  2. bad loading procedure
  3. aogr (Any other good reason)

Thanks

Could you check the batch norm parameters and running estimates rather than activation statistics?

print(bn.weight)
print(bn.bias)
print(bn.running_mean)
print(bn.running_var)

Hello @ptrblck,
Here are some BN parameters during the training phase (1st script, 21st epoch) of the (weight,bias,running_mean,running_var) for the last 3 BN layers

last bn2:  Parameter containing:
tensor([1.2982, 1.4249, 1.4694, 1.1842, 1.6337, 0.8268, 1.9409, 1.6504, 1.3809,
        0.9634, 0.9773, 1.1856, 1.4768, 1.0256, 1.4341, 1.8847, 1.6515, 1.6981,
        1.7260, 1.4342, 1.4955, 1.1769, 0.6325, 1.3704, 1.3990, 0.9546, 1.4415,
        2.1327, 1.9115, 1.5582, 1.6701, 0.8285, 1.1619, 0.7929, 1.4751, 1.3409,
        1.9051, 1.8635, 1.3695, 1.4589, 0.8433, 1.4457, 1.2850, 1.8375, 1.0605,
        0.9001, 1.2961, 1.4705, 1.2991, 1.4169, 1.4821, 1.1656, 1.3896, 1.6747,
        1.1240, 1.7229, 1.0663, 1.6442, 1.4051, 1.8916, 1.5729, 1.2942, 1.9144,
        1.8643], device='cuda:0', requires_grad=True) ,  
Parameter containing:
tensor([ 0.2687, -0.1299, -0.2862, -0.1981, -0.1809,  0.0323, -0.7159, -0.1473,
        -0.2203,  0.0297, -0.9406, -0.4634, -0.3956, -0.7438,  0.1869, -0.2611,
        -0.9550, -0.7194, -0.2852, -0.1727, -0.3562, -0.3324, -0.9602, -0.0945,
        -1.1909, -1.2160, -0.2824, -1.6246, -0.9243, -0.7471, -0.3608,  0.1162,
        -0.2825, -0.5812, -0.5545, -0.6702, -1.0005, -0.4372, -0.6702, -0.6860,
        -1.2266, -0.4523, -0.5960, -0.6286, -0.4859, -0.0594, -0.4732, -0.4010,
        -0.1200, -0.2199, -0.2765, -0.3081, -0.2625, -0.3765, -0.1883, -0.2134,
        -0.3645, -0.7368, -0.5896, -0.5729, -0.2501, -0.4575, -0.5873, -1.0232],
       device='cuda:0', requires_grad=True) ,  
tensor([-2.3414, -0.0762, -4.4657,  2.2102, -0.7565, -0.3413, -0.6956,  0.5306,
        -0.7504,  0.1537,  1.2946, -1.7770, -3.9468, -0.3032, -0.4972, -5.8405,
        -2.9753, -4.8750, -5.3343, -2.1997, -2.5710, -3.0228, -2.0223, -2.9445,
        -2.8789,  0.1822,  0.0370, -1.4450, -1.1861, -4.9425, -0.1154, -0.0144,
        -2.3126,  1.4797, -0.4425, -1.3551, -3.1357, -4.2052, -0.5410, -1.5914,
         5.0500, -1.9335,  0.2651,  0.5971, -2.0959, -2.7356, -1.6806,  0.0578,
        -0.1284, -2.2727, -2.9075,  1.3003, -2.6496, -6.1008, -0.3780, -4.2106,
        -1.2578, -2.2213, -1.2178, -3.5955,  0.6376, -1.4654, -3.1033, -1.9166],
       device='cuda:0') ,  
tensor([19.7372, 34.3483, 11.4133, 19.6713, 39.8743, 16.2498, 37.5281, 20.1511,
        12.4167, 47.1940, 19.4421, 10.0358, 20.8744, 37.6729, 23.4299, 30.1625,
        62.5809, 50.9769, 41.6174, 52.3469, 37.6369, 21.1181, 19.5070, 44.7570,
        14.1873, 10.1437, 26.9317, 47.2799, 73.8734, 13.9698, 18.3111, 18.3259,
        52.9219, 14.9510, 13.5643, 40.7860, 63.6663, 19.9395, 20.9938, 37.7942,
        21.9450, 13.1941, 13.9333, 26.6707, 84.7565, 34.0959,  7.9100, 37.2196,
        14.5561, 57.5539, 29.7350, 12.3494, 30.2384, 37.6077, 13.0121, 29.4787,
         9.1818, 28.0197,  7.5530, 41.6757, 22.0652, 12.7239, 27.2689, 37.0086],
       device='cuda:0')
last bn2:  Parameter containing:
tensor([1.2132, 0.9456, 1.3659, 1.3919, 1.1662, 1.1844, 1.2517, 1.1501, 1.4989,
        1.6320, 0.9488, 1.3737, 0.8806, 1.1000, 1.2283, 0.9194, 1.1356, 1.2489,
        1.3899, 1.3062, 1.2410, 1.3548, 0.9767, 1.4125, 1.6718, 1.2193, 1.1631,
        1.6019, 1.1481, 1.1907, 0.6430, 1.2457, 1.4394, 1.1496, 1.8405, 1.8280,
        0.8458, 1.0148, 1.6106, 1.2663, 1.2076, 1.5307, 1.3893, 1.6598, 1.3635,
        0.9697, 1.5816, 1.0839, 1.6166, 1.0449, 2.1544, 1.0454, 1.2257, 2.0998,
        1.3833, 2.1638, 1.3064, 1.8076, 1.2217, 1.3721, 1.3959, 1.3091, 1.9999,
        2.5280], device='cuda:0', requires_grad=True) ,  
Parameter containing:
tensor([-0.4818, -0.9780, -0.0048, -0.8147, -0.4602, -0.3517, -0.8782, -1.7261,
        -0.2661, -0.1016, -0.4094, -0.6321, -0.4227, -1.3490, -1.2217, -0.5916,
        -0.8924, -0.7010, -1.3590, -1.0263, -1.1012, -0.9614, -1.1294, -0.5805,
        -2.0266, -1.1708, -1.1252, -0.3307, -1.1568, -0.7484, -0.5215, -0.9342,
        -1.3919,  0.0272, -0.8650, -0.9238, -1.7281, -0.8843, -0.4509, -0.8337,
        -0.5345, -0.3246, -0.7102, -0.8581, -1.5506, -1.1191, -0.6252, -0.9698,
        -1.4920, -0.3427, -0.7036, -0.5153, -0.4201, -0.6413, -0.0301, -1.7298,
        -0.2978, -1.0555, -0.7219, -0.3115, -2.0753, -1.3508, -0.4335,  0.2637],
       device='cuda:0', requires_grad=True) ,  
tensor([-1.2027, -4.9190, -4.0278,  1.0473, -5.7369, -6.1543,  2.6430, -0.4797,
        -3.0251, -4.5064, -0.6882, -5.5206, -0.5206,  3.0135,  6.9326, -3.6242,
        -2.3153, -2.0628, -4.2832, -3.3764,  0.8328, -2.4588, -1.2909, -2.6999,
        -1.9278,  4.9518,  1.8203, -2.1241, -4.0407, -0.6308, -4.2802, -1.9552,
        -4.1911, -2.2117, -3.6712, -1.0228, -1.1096, -1.8062, -6.5683, -1.9105,
        -2.5110, -2.2063, -1.7795, -5.8016, -7.3785, -3.7854, -8.3147, -7.6108,
        -1.2826, -5.0902, -9.8604, -2.2407, -6.0330, -4.6392, -2.4429, -9.5457,
        -3.0837, -6.6456, -4.0732, -4.1598,  9.5285,  0.9562, -5.9486, -5.6741],
       device='cuda:0') ,  
tensor([ 12.2579,  25.1358,  50.5686,  33.3328,  39.3711,  27.0042,  32.3329,
         13.2217,  21.1123,  27.3941,  12.0679,  18.8699,   3.5534,  48.7358,
         26.1858,   8.9114,  44.7922,  29.5917,  43.7519,  16.3202,  24.2054,
         22.8319,  25.2850,  10.6213,  22.3811,  14.9700,  13.0580,  22.7720,
         26.4781,  36.9254,  26.0513,  21.4955,  47.0494,  24.9660,  53.1831,
         30.1543,  20.1754,   8.6688,  23.2288,  28.3502,   6.2628,  11.2093,
         20.6440,  27.4490,  25.8902,  19.8295,  48.4989,  32.7154,  39.9370,
         46.1750,  61.8067,   7.0948,  53.6007,  49.7234,  17.6993,  61.8807,
         12.8677,  30.2134,  26.5136,  31.7786,  22.1144,   8.7360,  69.4289,
        159.5967], device='cuda:0')
last bn2:  Parameter containing:
tensor([1.0131, 1.6776, 1.0399, 1.0545, 1.2998, 0.8719, 0.9924, 0.8877, 0.8924,
        1.4898, 0.7500, 0.7569, 1.4165, 1.0043, 0.9298, 1.2033, 1.5913, 1.3198,
        1.8148, 1.3644, 1.4178, 1.3320, 1.5484, 1.2796, 1.7728, 1.2193, 0.5042,
        1.6028, 1.3242, 0.8343, 1.5554, 1.3488, 1.6786, 1.3033, 0.9270, 0.8402,
        1.7525, 1.6097, 1.0565, 0.9529, 1.4094, 1.3494, 1.3724, 1.7803, 2.2156,
        0.8547, 0.9591, 0.9370, 1.5401, 1.3557, 2.0128, 0.9239, 1.8261, 1.1380,
        1.4736, 1.2205, 0.8229, 1.3655, 0.8386, 1.2024, 0.8697, 2.3035, 1.8947,
        2.9846], device='cuda:0', requires_grad=True) ,  
Parameter containing:
tensor([-0.7493, -0.3273, -1.2438, -0.5735, -0.4680, -0.5814, -0.7146, -0.5950,
        -0.7714, -1.3816, -0.6494, -0.8979, -1.4037, -0.6445, -0.5519, -0.9010,
        -0.4628, -0.5682, -1.0769, -1.0795, -1.4922, -1.2488, -0.2147, -0.8856,
        -0.3575, -1.3453, -1.4390, -0.8148, -1.6596, -0.9726, -1.0171, -1.6122,
        -1.3630, -1.0178, -0.6567, -1.3944, -1.7762, -1.0128, -1.1859, -0.6930,
        -0.6512, -1.3937, -1.2731, -1.6545, -0.7730, -0.2427, -0.9498, -0.2645,
        -1.7236, -0.6182, -1.6050, -0.7177, -0.5209, -1.2028, -1.1035, -0.7665,
        -0.7494, -0.5912, -0.7557, -1.0471, -1.0026, -1.6275, -1.3899, -1.8718],
       device='cuda:0', requires_grad=True) , 
 tensor([ 2.1060, -3.3184,  1.3379, -1.8766,  0.0092,  1.7759, -1.8018, -0.0068,
        -0.8404,  3.5756, -2.7357,  0.4318,  3.1491, -1.4038, -2.7734,  1.6359,
        -4.2769,  1.6799,  2.8231,  2.4750,  3.4888,  0.4834, -2.4669, -0.9539,
        -0.7643,  0.2960, -1.2066, -2.8098, -0.7797, -0.2865, -2.7306,  1.7672,
        -0.9632, -0.0790, -2.8772, -1.1044,  1.6680,  0.3495, -1.1186,  1.5150,
         0.0268,  1.1520,  2.1410, -0.6470, -1.9260, -3.2692, -1.0425, -1.1205,
         0.6737, -1.7648, -2.6867, -0.1441, -0.2340, -0.5263, -0.2512,  0.7065,
        -0.4533, -0.8497,  0.3349, -1.3326,  0.4484, -1.1717, -2.8486, -5.8584],
       device='cuda:0') ,  
tensor([ 33.5016,  96.4214,  23.4838,  21.5238,  83.2108,  31.5242,  13.6605,
         23.1879,  19.5391,  47.3836,  37.6485,  15.5467,   7.9038,  79.0533,
         16.8746,  28.5609,  93.4751,  57.6028,  35.8309,  60.9123,  33.2646,
         26.7693,  73.8285,  37.5991, 129.0248,  29.5361,  14.8129,  69.9999,
         23.2045,  25.6856,  27.3631,  24.2894,  21.1960,  38.7368,  19.7641,
          6.2707,  64.6141,  50.1681,   6.9468,  26.0967,  24.1769,  12.3743,
         44.1202,  38.8951, 150.1269,  53.2054,   8.7804,  29.0404,  82.6736,
         61.4551,  89.7517,  15.1844, 137.6564,  26.1992,  28.2784,  51.4460,
          7.6213,  28.9920,  13.3594,  16.5385,  30.4891,  64.6970,  93.4503,
        242.5249], device='cuda:0')

and now for the 2nd script for a sample of the training set

last bn2:  Parameter containing:
tensor([1.3130, 1.4322, 1.4759, 1.1870, 1.6422, 0.8240, 1.9335, 1.6578, 1.3818,
        0.9475, 0.9631, 1.1852, 1.4794, 1.0388, 1.4503, 1.8852, 1.6586, 1.6897,
        1.7296, 1.4360, 1.4960, 1.1951, 0.6587, 1.3640, 1.4112, 0.9508, 1.4449,
        2.1314, 1.9235, 1.5671, 1.6716, 0.8424, 1.1672, 0.7789, 1.4773, 1.3347,
        1.9011, 1.8599, 1.3713, 1.4554, 0.8364, 1.4475, 1.2822, 1.8409, 1.0716,
        0.9194, 1.3011, 1.4609, 1.2963, 1.4187, 1.4787, 1.1726, 1.3884, 1.6725,
        1.1408, 1.7288, 1.0696, 1.6504, 1.4019, 1.8920, 1.5797, 1.3001, 1.9136,
        1.8508], device='cuda:0', requires_grad=True) ,  
Parameter containing:
tensor([ 0.2610, -0.1192, -0.2794, -0.1839, -0.1911,  0.0302, -0.7239, -0.1372,
        -0.2137,  0.0203, -0.9448, -0.4677, -0.3949, -0.7334,  0.1920, -0.2577,
        -0.9503, -0.7198, -0.2795, -0.1703, -0.3546, -0.3431, -0.9536, -0.1196,
        -1.1900, -1.2218, -0.2807, -1.6274, -0.9271, -0.7483, -0.3583,  0.1173,
        -0.2678, -0.5785, -0.5577, -0.6650, -1.0104, -0.4461, -0.6745, -0.6822,
        -1.2277, -0.4532, -0.6101, -0.6298, -0.4814, -0.0632, -0.4760, -0.4085,
        -0.1274, -0.2159, -0.2797, -0.3004, -0.2740, -0.3768, -0.1742, -0.2056,
        -0.3643, -0.7323, -0.6014, -0.5840, -0.2508, -0.4477, -0.5840, -1.0294],
       device='cuda:0', requires_grad=True) ,  
tensor([-2.2487, -0.0267, -4.4345,  2.2349, -0.5939, -0.3146, -0.7059,  0.5346,
        -0.7631,  0.2327,  1.2890, -1.7704, -3.8160, -0.3621, -0.3365, -5.7577,
        -2.8567, -4.7889, -5.2449, -2.0464, -2.3964, -2.8191, -1.9809, -2.8985,
        -2.8649,  0.1490,  0.1599, -1.3721, -1.1775, -4.7976, -0.1694,  0.0372,
        -2.2149,  1.4353, -0.5029, -1.3503, -3.0341, -4.2152, -0.5042, -1.5323,
         4.9827, -1.8563,  0.3215,  0.6166, -2.0846, -2.6593, -1.6455,  0.0569,
        -0.1572, -2.2454, -2.8570,  1.2430, -2.4890, -5.8967, -0.2882, -4.0961,
        -1.1955, -2.1761, -1.1692, -3.5584,  0.7872, -1.3242, -3.1595, -1.8767],
       device='cuda:0') ,  
tensor([18.6767, 34.1358, 11.4390, 19.5910, 39.6207, 15.6089, 38.5675, 19.6088,
        12.0001, 48.5460, 18.9193, 10.2684, 20.7013, 38.5611, 22.5086, 30.3580,
        60.5513, 47.2820, 37.0344, 55.6837, 31.7863, 18.8953, 19.9124, 45.9320,
        13.5144,  9.5781, 26.2481, 48.4623, 73.2046, 14.0140, 17.0365, 18.2471,
        52.2522, 15.1296, 13.7018, 40.5819, 63.3091, 19.0931, 22.0876, 39.8415,
        21.2570, 13.0269, 13.2726, 27.5679, 87.0976, 33.8244,  7.8651, 37.4086,
        14.9787, 58.3856, 30.9363, 12.5442, 31.0509, 37.3950, 12.2668, 30.5216,
         9.2875, 27.5106,  7.7447, 42.8648, 20.2751, 13.2040, 27.8389, 36.9584],
       device='cuda:0')
last bn2:  Parameter containing:
tensor([1.2176, 0.9395, 1.3707, 1.4056, 1.1606, 1.1889, 1.2556, 1.1792, 1.4981,
        1.6257, 0.9540, 1.3723, 0.8800, 1.1068, 1.2486, 0.9237, 1.1463, 1.2368,
        1.3898, 1.3123, 1.2511, 1.3555, 0.9953, 1.4025, 1.6692, 1.2259, 1.1611,
        1.6055, 1.1524, 1.1827, 0.6460, 1.2517, 1.4539, 1.1413, 1.8356, 1.8307,
        0.8544, 1.0127, 1.6023, 1.2618, 1.2136, 1.5359, 1.3891, 1.6707, 1.3745,
        0.9744, 1.5826, 1.0776, 1.6267, 1.0411, 2.1596, 1.0476, 1.2341, 2.1007,
        1.3851, 2.1707, 1.3042, 1.8146, 1.2235, 1.3841, 1.4139, 1.3153, 2.0093,
        2.5413], device='cuda:0', requires_grad=True) ,  
Parameter containing:
tensor([-4.7871e-01, -9.6927e-01, -1.7224e-03, -8.1840e-01, -4.6485e-01,
        -3.5434e-01, -8.7399e-01, -1.7173e+00, -2.6639e-01, -1.1620e-01,
        -4.0594e-01, -6.3611e-01, -4.2247e-01, -1.3457e+00, -1.2041e+00,
        -5.9009e-01, -8.8615e-01, -7.0870e-01, -1.3521e+00, -1.0227e+00,
        -1.0985e+00, -9.6021e-01, -1.1122e+00, -5.9000e-01, -2.0304e+00,
        -1.1716e+00, -1.1234e+00, -3.3070e-01, -1.1518e+00, -7.5751e-01,
        -5.1710e-01, -9.3377e-01, -1.3751e+00,  1.6681e-02, -8.6775e-01,
        -9.1537e-01, -1.7293e+00, -8.8723e-01, -4.6228e-01, -8.3449e-01,
        -5.3614e-01, -3.2286e-01, -7.1319e-01, -8.5197e-01, -1.5393e+00,
        -1.1145e+00, -6.2064e-01, -9.7529e-01, -1.4950e+00, -3.4445e-01,
        -6.9570e-01, -5.1130e-01, -4.3359e-01, -6.4706e-01, -2.8110e-02,
        -1.7330e+00, -3.0567e-01, -1.0566e+00, -7.2447e-01, -2.9905e-01,
        -2.0701e+00, -1.3506e+00, -4.2667e-01,  2.5480e-01], device='cuda:0',
       requires_grad=True) ,  
tensor([-1.1898, -4.9351, -4.0057,  1.0086, -5.6319, -5.9821,  2.6194, -0.3552,
        -3.0575, -4.3138, -0.6626, -5.4045, -0.5088,  2.9480,  6.8847, -3.5900,
        -2.3166, -1.9901, -4.2139, -3.2843,  0.8747, -2.4156, -1.2901, -2.6682,
        -1.9121,  4.9088,  1.8143, -2.0511, -3.9360, -0.6246, -4.2520, -1.9796,
        -4.1742, -2.1529, -3.6492, -0.9864, -1.0758, -1.8126, -6.4839, -1.7888,
        -2.4873, -2.1558, -1.7248, -5.7122, -7.2257, -3.7576, -8.2205, -7.5753,
        -1.2968, -5.0736, -9.8686, -2.1209, -5.9372, -4.5909, -2.3356, -9.4508,
        -3.1045, -6.6114, -3.9840, -4.0213,  9.4352,  0.9532, -5.9125, -5.4879],
       device='cuda:0') ,  
tensor([ 12.1167,  22.3456,  52.5596,  32.6017,  39.2995,  26.2953,  32.5620,
         11.8171,  20.9380,  26.9390,  12.6023,  18.8558,   3.4587,  48.1535,
         26.0128,   8.8467,  44.3347,  29.0326,  43.6828,  15.9957,  22.5822,
         23.3266,  25.0776,  10.8235,  22.8202,  14.4317,  12.7207,  22.4064,
         24.1998,  35.3748,  25.5444,  21.8595,  47.5199,  24.5493,  49.7043,
         30.0234,  20.0453,   8.8537,  25.1898,  27.8654,   6.3191,  10.7331,
         20.3446,  29.3200,  25.7950,  19.1852,  46.6275,  32.5919,  40.6134,
         46.5111,  63.7662,   6.8436,  52.5796,  50.2659,  15.8908,  65.2752,
         13.9098,  30.3731,  26.3604,  31.9325,  23.2201,   9.0683,  68.3035,
        160.0729], device='cuda:0')
last bn2:  Parameter containing:
tensor([1.0219, 1.6891, 1.0429, 1.0544, 1.3186, 0.8925, 0.9936, 0.8938, 0.9181,
        1.5095, 0.7538, 0.7682, 1.4441, 1.0064, 0.9363, 1.2163, 1.5969, 1.3383,
        1.8469, 1.3749, 1.4462, 1.3553, 1.5929, 1.2899, 1.7856, 1.2306, 0.5044,
        1.6199, 1.3629, 0.8456, 1.5677, 1.3696, 1.7176, 1.3275, 0.9310, 0.8524,
        1.7653, 1.6210, 1.0722, 0.9589, 1.4297, 1.3783, 1.3891, 1.8061, 2.2377,
        0.8583, 0.9615, 0.9531, 1.5522, 1.3637, 2.0357, 0.9248, 1.8534, 1.1628,
        1.5027, 1.2333, 0.8328, 1.3651, 0.8495, 1.2171, 0.8875, 2.3208, 1.9196,
        3.0162], device='cuda:0', requires_grad=True) ,  
Parameter containing:
tensor([-0.7427, -0.3284, -1.2478, -0.5741, -0.4600, -0.5704, -0.7137, -0.5929,
        -0.7640, -1.3865, -0.6477, -0.8946, -1.4014, -0.6391, -0.5462, -0.8997,
        -0.4648, -0.5642, -1.0706, -1.0728, -1.4918, -1.2360, -0.2007, -0.8954,
        -0.3590, -1.3371, -1.4395, -0.8172, -1.6494, -0.9682, -1.0186, -1.6056,
        -1.3530, -1.0076, -0.6530, -1.3895, -1.7711, -1.0099, -1.1893, -0.6902,
        -0.6474, -1.3866, -1.2755, -1.6510, -0.7665, -0.2408, -0.9487, -0.2597,
        -1.7183, -0.6129, -1.6070, -0.7182, -0.5159, -1.2014, -1.0848, -0.7622,
        -0.7515, -0.5942, -0.7518, -1.0430, -0.9992, -1.6285, -1.3916, -1.8781],
       device='cuda:0', requires_grad=True) ,  
tensor([ 2.1121, -3.2006,  1.2960, -1.8621,  0.1384,  1.8195, -1.7681,  0.0503,
        -0.8597,  3.6108, -2.7463,  0.3970,  3.2550, -1.3785, -2.7581,  1.6680,
        -4.1868,  1.6791,  2.6552,  2.5798,  3.3567,  0.3934, -2.3929, -0.8428,
        -0.7063,  0.3074, -1.1835, -2.7162, -0.7658, -0.3113, -2.6071,  1.8666,
        -0.9576, -0.1988, -2.8538, -1.1063,  1.4784,  0.2578, -1.0597,  1.5955,
         0.0728,  1.1215,  2.2330, -0.6613, -1.7490, -3.2380, -1.0174, -1.0887,
         0.6344, -1.6754, -2.7742, -0.1090, -0.0352, -0.5426, -0.4153,  0.7688,
        -0.4020, -0.8723,  0.2263, -1.2617,  0.4648, -1.2322, -2.9105, -5.7573],
       device='cuda:0') ,  
tensor([ 32.5671,  86.5258,  23.2426,  21.0717,  77.4587,  31.4727,  12.7437,
         22.1881,  19.2188,  48.6049,  37.4767,  15.4108,   8.7037,  75.8836,
         16.1836,  28.2189,  84.7191,  56.2729,  32.2407,  61.4940,  30.5091,
         24.6064,  77.9436,  34.8665, 123.5879,  28.0330,  13.2515,  65.2489,
         23.6730,  24.9784,  24.2900,  24.3675,  22.1897,  37.7058,  19.1534,
          5.8579,  59.0559,  48.7444,   6.7933,  25.9317,  23.4334,  12.7552,
         44.6136,  38.6002, 136.9605,  50.7987,   8.6579,  28.8959,  81.1509,
         58.0896,  84.2928,  14.8322, 131.0033,  26.4126,  27.4327,  49.1497,
          7.1132,  27.4740,  12.2148,  16.6042,  29.0359,  67.4034,  85.2352,
        216.4696], device='cuda:0')

and for a sample of the testing sample:

test [ 9 ]
last bn2:  Parameter containing:
tensor([1.3130, 1.4322, 1.4759, 1.1870, 1.6422, 0.8240, 1.9335, 1.6578, 1.3818,
        0.9475, 0.9631, 1.1852, 1.4794, 1.0388, 1.4503, 1.8852, 1.6586, 1.6897,
        1.7296, 1.4360, 1.4960, 1.1951, 0.6587, 1.3640, 1.4112, 0.9508, 1.4449,
        2.1314, 1.9235, 1.5671, 1.6716, 0.8424, 1.1672, 0.7789, 1.4773, 1.3347,
        1.9011, 1.8599, 1.3713, 1.4554, 0.8364, 1.4475, 1.2822, 1.8409, 1.0716,
        0.9194, 1.3011, 1.4609, 1.2963, 1.4187, 1.4787, 1.1726, 1.3884, 1.6725,
        1.1408, 1.7288, 1.0696, 1.6504, 1.4019, 1.8920, 1.5797, 1.3001, 1.9136,
        1.8508], device='cuda:0', requires_grad=True) ,  
Parameter containing:
tensor([ 0.2610, -0.1192, -0.2794, -0.1839, -0.1911,  0.0302, -0.7239, -0.1372,
        -0.2137,  0.0203, -0.9448, -0.4677, -0.3949, -0.7334,  0.1920, -0.2577,
        -0.9503, -0.7198, -0.2795, -0.1703, -0.3546, -0.3431, -0.9536, -0.1196,
        -1.1900, -1.2218, -0.2807, -1.6274, -0.9271, -0.7483, -0.3583,  0.1173,
        -0.2678, -0.5785, -0.5577, -0.6650, -1.0104, -0.4461, -0.6745, -0.6822,
        -1.2277, -0.4532, -0.6101, -0.6298, -0.4814, -0.0632, -0.4760, -0.4085,
        -0.1274, -0.2159, -0.2797, -0.3004, -0.2740, -0.3768, -0.1742, -0.2056,
        -0.3643, -0.7323, -0.6014, -0.5840, -0.2508, -0.4477, -0.5840, -1.0294],
       device='cuda:0', requires_grad=True) ,  
tensor([-2.2487, -0.0267, -4.4345,  2.2349, -0.5939, -0.3146, -0.7059,  0.5346,
        -0.7631,  0.2327,  1.2890, -1.7704, -3.8160, -0.3621, -0.3365, -5.7577,
        -2.8567, -4.7889, -5.2449, -2.0464, -2.3964, -2.8191, -1.9809, -2.8985,
        -2.8649,  0.1490,  0.1599, -1.3721, -1.1775, -4.7976, -0.1694,  0.0372,
        -2.2149,  1.4353, -0.5029, -1.3503, -3.0341, -4.2152, -0.5042, -1.5323,
         4.9827, -1.8563,  0.3215,  0.6166, -2.0846, -2.6593, -1.6455,  0.0569,
        -0.1572, -2.2454, -2.8570,  1.2430, -2.4890, -5.8967, -0.2882, -4.0961,
        -1.1955, -2.1761, -1.1692, -3.5584,  0.7872, -1.3242, -3.1595, -1.8767],
       device='cuda:0') ,  
tensor([18.6767, 34.1358, 11.4390, 19.5910, 39.6207, 15.6089, 38.5675, 19.6088,
        12.0001, 48.5460, 18.9193, 10.2684, 20.7013, 38.5611, 22.5086, 30.3580,
        60.5513, 47.2820, 37.0344, 55.6837, 31.7863, 18.8953, 19.9124, 45.9320,
        13.5144,  9.5781, 26.2481, 48.4623, 73.2046, 14.0140, 17.0365, 18.2471,
        52.2522, 15.1296, 13.7018, 40.5819, 63.3091, 19.0931, 22.0876, 39.8415,
        21.2570, 13.0269, 13.2726, 27.5679, 87.0976, 33.8244,  7.8651, 37.4086,
        14.9787, 58.3856, 30.9363, 12.5442, 31.0509, 37.3950, 12.2668, 30.5216,
         9.2875, 27.5106,  7.7447, 42.8648, 20.2751, 13.2040, 27.8389, 36.9584],
       device='cuda:0')
last bn2:  Parameter containing:
tensor([1.2176, 0.9395, 1.3707, 1.4056, 1.1606, 1.1889, 1.2556, 1.1792, 1.4981,
        1.6257, 0.9540, 1.3723, 0.8800, 1.1068, 1.2486, 0.9237, 1.1463, 1.2368,
        1.3898, 1.3123, 1.2511, 1.3555, 0.9953, 1.4025, 1.6692, 1.2259, 1.1611,
        1.6055, 1.1524, 1.1827, 0.6460, 1.2517, 1.4539, 1.1413, 1.8356, 1.8307,
        0.8544, 1.0127, 1.6023, 1.2618, 1.2136, 1.5359, 1.3891, 1.6707, 1.3745,
        0.9744, 1.5826, 1.0776, 1.6267, 1.0411, 2.1596, 1.0476, 1.2341, 2.1007,
        1.3851, 2.1707, 1.3042, 1.8146, 1.2235, 1.3841, 1.4139, 1.3153, 2.0093,
        2.5413], device='cuda:0', requires_grad=True) ,  
Parameter containing:
tensor([-4.7871e-01, -9.6927e-01, -1.7224e-03, -8.1840e-01, -4.6485e-01,
        -3.5434e-01, -8.7399e-01, -1.7173e+00, -2.6639e-01, -1.1620e-01,
        -4.0594e-01, -6.3611e-01, -4.2247e-01, -1.3457e+00, -1.2041e+00,
        -5.9009e-01, -8.8615e-01, -7.0870e-01, -1.3521e+00, -1.0227e+00,
        -1.0985e+00, -9.6021e-01, -1.1122e+00, -5.9000e-01, -2.0304e+00,
        -1.1716e+00, -1.1234e+00, -3.3070e-01, -1.1518e+00, -7.5751e-01,
        -5.1710e-01, -9.3377e-01, -1.3751e+00,  1.6681e-02, -8.6775e-01,
        -9.1537e-01, -1.7293e+00, -8.8723e-01, -4.6228e-01, -8.3449e-01,
        -5.3614e-01, -3.2286e-01, -7.1319e-01, -8.5197e-01, -1.5393e+00,
        -1.1145e+00, -6.2064e-01, -9.7529e-01, -1.4950e+00, -3.4445e-01,
        -6.9570e-01, -5.1130e-01, -4.3359e-01, -6.4706e-01, -2.8110e-02,
        -1.7330e+00, -3.0567e-01, -1.0566e+00, -7.2447e-01, -2.9905e-01,
        -2.0701e+00, -1.3506e+00, -4.2667e-01,  2.5480e-01], device='cuda:0',
       requires_grad=True) ,  
tensor([-1.1898, -4.9351, -4.0057,  1.0086, -5.6319, -5.9821,  2.6194, -0.3552,
        -3.0575, -4.3138, -0.6626, -5.4045, -0.5088,  2.9480,  6.8847, -3.5900,
        -2.3166, -1.9901, -4.2139, -3.2843,  0.8747, -2.4156, -1.2901, -2.6682,
        -1.9121,  4.9088,  1.8143, -2.0511, -3.9360, -0.6246, -4.2520, -1.9796,
        -4.1742, -2.1529, -3.6492, -0.9864, -1.0758, -1.8126, -6.4839, -1.7888,
        -2.4873, -2.1558, -1.7248, -5.7122, -7.2257, -3.7576, -8.2205, -7.5753,
        -1.2968, -5.0736, -9.8686, -2.1209, -5.9372, -4.5909, -2.3356, -9.4508,
        -3.1045, -6.6114, -3.9840, -4.0213,  9.4352,  0.9532, -5.9125, -5.4879],
       device='cuda:0') ,  
tensor([ 12.1167,  22.3456,  52.5596,  32.6017,  39.2995,  26.2953,  32.5620,
         11.8171,  20.9380,  26.9390,  12.6023,  18.8558,   3.4587,  48.1535,
         26.0128,   8.8467,  44.3347,  29.0326,  43.6828,  15.9957,  22.5822,
         23.3266,  25.0776,  10.8235,  22.8202,  14.4317,  12.7207,  22.4064,
         24.1998,  35.3748,  25.5444,  21.8595,  47.5199,  24.5493,  49.7043,
         30.0234,  20.0453,   8.8537,  25.1898,  27.8654,   6.3191,  10.7331,
         20.3446,  29.3200,  25.7950,  19.1852,  46.6275,  32.5919,  40.6134,
         46.5111,  63.7662,   6.8436,  52.5796,  50.2659,  15.8908,  65.2752,
         13.9098,  30.3731,  26.3604,  31.9325,  23.2201,   9.0683,  68.3035,
        160.0729], device='cuda:0')
last bn2:  Parameter containing:
tensor([1.0219, 1.6891, 1.0429, 1.0544, 1.3186, 0.8925, 0.9936, 0.8938, 0.9181,
        1.5095, 0.7538, 0.7682, 1.4441, 1.0064, 0.9363, 1.2163, 1.5969, 1.3383,
        1.8469, 1.3749, 1.4462, 1.3553, 1.5929, 1.2899, 1.7856, 1.2306, 0.5044,
        1.6199, 1.3629, 0.8456, 1.5677, 1.3696, 1.7176, 1.3275, 0.9310, 0.8524,
        1.7653, 1.6210, 1.0722, 0.9589, 1.4297, 1.3783, 1.3891, 1.8061, 2.2377,
        0.8583, 0.9615, 0.9531, 1.5522, 1.3637, 2.0357, 0.9248, 1.8534, 1.1628,
        1.5027, 1.2333, 0.8328, 1.3651, 0.8495, 1.2171, 0.8875, 2.3208, 1.9196,
        3.0162], device='cuda:0', requires_grad=True) ,  
Parameter containing:
tensor([-0.7427, -0.3284, -1.2478, -0.5741, -0.4600, -0.5704, -0.7137, -0.5929,
        -0.7640, -1.3865, -0.6477, -0.8946, -1.4014, -0.6391, -0.5462, -0.8997,
        -0.4648, -0.5642, -1.0706, -1.0728, -1.4918, -1.2360, -0.2007, -0.8954,
        -0.3590, -1.3371, -1.4395, -0.8172, -1.6494, -0.9682, -1.0186, -1.6056,
        -1.3530, -1.0076, -0.6530, -1.3895, -1.7711, -1.0099, -1.1893, -0.6902,
        -0.6474, -1.3866, -1.2755, -1.6510, -0.7665, -0.2408, -0.9487, -0.2597,
        -1.7183, -0.6129, -1.6070, -0.7182, -0.5159, -1.2014, -1.0848, -0.7622,
        -0.7515, -0.5942, -0.7518, -1.0430, -0.9992, -1.6285, -1.3916, -1.8781],
       device='cuda:0', requires_grad=True) ,  
tensor([ 2.1121, -3.2006,  1.2960, -1.8621,  0.1384,  1.8195, -1.7681,  0.0503,
        -0.8597,  3.6108, -2.7463,  0.3970,  3.2550, -1.3785, -2.7581,  1.6680,
        -4.1868,  1.6791,  2.6552,  2.5798,  3.3567,  0.3934, -2.3929, -0.8428,
        -0.7063,  0.3074, -1.1835, -2.7162, -0.7658, -0.3113, -2.6071,  1.8666,
        -0.9576, -0.1988, -2.8538, -1.1063,  1.4784,  0.2578, -1.0597,  1.5955,
         0.0728,  1.1215,  2.2330, -0.6613, -1.7490, -3.2380, -1.0174, -1.0887,
         0.6344, -1.6754, -2.7742, -0.1090, -0.0352, -0.5426, -0.4153,  0.7688,
        -0.4020, -0.8723,  0.2263, -1.2617,  0.4648, -1.2322, -2.9105, -5.7573],
       device='cuda:0') ,  
tensor([ 32.5671,  86.5258,  23.2426,  21.0717,  77.4587,  31.4727,  12.7437,
         22.1881,  19.2188,  48.6049,  37.4767,  15.4108,   8.7037,  75.8836,
         16.1836,  28.2189,  84.7191,  56.2729,  32.2407,  61.4940,  30.5091,
         24.6064,  77.9436,  34.8665, 123.5879,  28.0330,  13.2515,  65.2489,
         23.6730,  24.9784,  24.2900,  24.3675,  22.1897,  37.7058,  19.1534,
          5.8579,  59.0559,  48.7444,   6.7933,  25.9317,  23.4334,  12.7552,
         44.6136,  38.6002, 136.9605,  50.7987,   8.6579,  28.8959,  81.1509,
         58.0896,  84.2928,  14.8322, 131.0033,  26.4126,  27.4327,  49.1497,
          7.1132,  27.4740,  12.2148,  16.6042,  29.0359,  67.4034,  85.2352,
        216.4696], device='cuda:0')

I’m not sure what the difference between the first two outputs is, but it looks like the second and the last (testing) yield the same results, so I would exclude the batch norm layer as the root cause.
Moreover, I would suspect the data pipeline to be responsible for the different results.

The first script is the one that makes the training of the model with training set for optimisation and testing, and saving. The first output print is from this script.

The second script load the model and pass with model.eval(), both the training set and the test set. The two print (train[9] and test[9]) are from this script.

In fact, if I compute the relative difference between the parameters (weights, bias, running_mean, running_var) of the 3 last BN obtzained during the training phase (1st script) , and the debugging script (2nd script) I generaly got 1% to few per mille difference, except for 1 running_mean componant of the 3rd BN layers where I got in absolute value: tensor(-0.0068) versus tensor(0.0503) and tensor(0.0092) versus tensor(0.1384).

I do not know if these differences can explain what I see for the stats of outputs of the BN layers, and the losses. Anyhow, I will check the data pipeline as you suggest.

Thanks

Dear @ptrblck,

I have proceeded to the following experiment:

  1. I have switch off both Random Shuffling of the data_laoders as well as Data Augmentation
  2. I have dumped the first 5 images input of the first 10 batches both Train set and Test sets during the Training session (1st script) and during the Debugging session (2nd script). I have fully checked that the images at the input of the network was the same epoch per epoch during the training, notably the last epoch where I have saved the model; and the same images were loaded during the Debugging session which load also the model state dictionary.

So, Here are the outputs of the models:

  1. on top panels (Training session last epoch): outputs of the 5 first images of the train batch 0 (left), outputs of the 5 first images of the train batch 9 (middle), outputs of the 5 first images of the test batch 0 (right)
  2. on bottom panels there are the outputs of the same inputs of the top panels but during the debugging session

Clearly there is something strange during the Debugging session concerning the model parameters. An idea?

I have just cross-checked that if I dump all the layers weights/bias stat (min/max/mean/std) before saving (1st script during training session), then the same weights/bias stat of all the layers are loaded in the debugging session (2nd script).

So if everything is ok: what can leads to different outputs if I give the same inputs to a model that seems to have the same parameter loaded (at least the same stats as described above?). Note that in the debugging session I switch to model.eval().
Thanks

model.eval() will use the running stats, which might be wrong or skewed depending on your dataset.
If you are concerned, if the batch norm layers give valid answers, could you please compare the same outputs with the same modes?
E.g. make sure that a fixed input creates the same output in model.eval().
Note that you will not get the same outputs for the fixed data in model.train(), as the running stats will be updated.

Thanks @ptrblck,

model.eval() will use the running stats, which might be wrong or skewed depending on your dataset.

Do you mean that 100k input images with data augmentation (flip/rotations) for resnet20 or higher resnetXXX is not enougth ?

If you are concerned, if the batch norm layers give valid answers, could you please compare the same outputs with the same modes?
E.g. make sure that a fixed input creates the same output in model.eval() .
Note that you will not get the same outputs for the fixed data in model.train() , as the running stats will be updated.

In fact, notice that the last column of the series of 3x2 plots above show 1) the outputs with ‘model.eval()’ for the 1st script, and also ‘model.eval()’ for the second script that loads the model parameters.
Aren’t they comparable?

Thanks

It might still produce skewed running estimates, e.g. if you are dealing with 100 different distributions each creating 1000 samples.
E.g. if you are using 100k ImageNet samples and mix them with 100k CT scans, I would assume that the stats of the data will be somewhat inbetween, which would make the validation and test bad.

May be you are right, but my 100k sets are originated from the same larger set of 600k images.