Hi,
i am currently training a DNN and observe exploding gradients during training with CPU. When i train the exact same network with identical settings on GPU, I do not have this problem.
My Network Architecture consists of a Backbone (ResNet50 in my case), one single Conv1D layer to reduce the number of channels of the backbone fmaps and finally a transformer.
The exploding gradients exclusively occur in the backbone params, or the single Conv1D layer directly after the backbone. The transformer gradients are fine. In most cases only one or two parameters are impacted at a time. As stated previously, training on GPU works without exploding gradients.
Sample Output of Gradients during Training with CPU:
Layer input_proj.weight: tensor(0.2441)
Layer input_proj.bias: tensor(0.0393)
Layer backbone.0.body.layer2.0.conv1.weight: tensor(nan)
Layer backbone.0.body.layer2.0.conv2.weight: tensor(0.1618)
Layer backbone.0.body.layer2.0.conv3.weight: tensor(0.2671)
Layer backbone.0.body.layer2.0.downsample.0.weight: tensor(0.2590)
Layer backbone.0.body.layer2.1.conv1.weight: tensor(0.2361)
Layer backbone.0.body.layer2.1.conv2.weight: tensor(0.2590)
Layer backbone.0.body.layer2.1.conv3.weight: tensor(0.2495)
Layer backbone.0.body.layer2.2.conv1.weight: tensor(0.1893)
Layer backbone.0.body.layer2.2.conv2.weight: tensor(0.2557)
Layer backbone.0.body.layer2.2.conv3.weight: tensor(0.2158)
Layer backbone.0.body.layer2.3.conv1.weight: tensor(0.1955)
Layer backbone.0.body.layer2.3.conv2.weight: tensor(0.1515)
Layer backbone.0.body.layer2.3.conv3.weight: tensor(0.2030)
Layer backbone.0.body.layer3.0.conv1.weight: tensor(0.3426)
Layer backbone.0.body.layer3.0.conv2.weight: tensor(0.1487)
Layer backbone.0.body.layer3.0.conv3.weight: tensor(0.3212)
Layer backbone.0.body.layer3.0.downsample.0.weight: tensor(nan)
Layer backbone.0.body.layer3.1.conv1.weight: tensor(0.2182)
Layer backbone.0.body.layer3.1.conv2.weight: tensor(0.1685)
Layer backbone.0.body.layer3.1.conv3.weight: tensor(0.2598)
Layer backbone.0.body.layer3.2.conv1.weight: tensor(0.2066)
Layer backbone.0.body.layer3.2.conv2.weight: tensor(0.1889)
Layer backbone.0.body.layer3.2.conv3.weight: tensor(0.1983)
Layer backbone.0.body.layer3.3.conv1.weight: tensor(0.1970)
Layer backbone.0.body.layer3.3.conv2.weight: tensor(0.1632)
Layer backbone.0.body.layer3.3.conv3.weight: tensor(0.2933)
Layer backbone.0.body.layer3.4.conv1.weight: tensor(0.1881)
Layer backbone.0.body.layer3.4.conv2.weight: tensor(0.1286)
Layer backbone.0.body.layer3.4.conv3.weight: tensor(0.1876)
Layer backbone.0.body.layer3.5.conv1.weight: tensor(0.2929)
Layer backbone.0.body.layer3.5.conv2.weight: tensor(0.2074)
Layer backbone.0.body.layer3.5.conv3.weight: tensor(0.2674)
Layer backbone.0.body.layer4.0.conv1.weight: tensor(0.2095)
Layer backbone.0.body.layer4.0.conv2.weight: tensor(0.3962)
Layer backbone.0.body.layer4.0.conv3.weight: tensor(0.3441)
Layer backbone.0.body.layer4.0.downsample.0.weight: tensor(3.8545e+18)
Layer backbone.0.body.layer4.1.conv1.weight: tensor(0.3483)
Layer backbone.0.body.layer4.1.conv2.weight: tensor(0.0979)
Layer backbone.0.body.layer4.1.conv3.weight: tensor(0.2273)
Layer backbone.0.body.layer4.2.conv1.weight: tensor(0.3254)
Layer backbone.0.body.layer4.2.conv2.weight: tensor(0.0833)
Layer backbone.0.body.layer4.2.conv3.weight: tensor(0.2798)
pytroch version: 2.2.2
numpy version: 1.26.4 (since i read, numpy might be a root cause for the issue)
Any idea why this might be the case?
Thanks in advance