If I change the weight initialization to something like
for m in self.modules():
if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
import scipy.stats as stats
stddev = m.stddev if hasattr(m, 'stddev') else 0.1
X = stats.truncnorm(-2, 2, scale=stddev)
#values = torch.as_tensor(X.rvs(m.weight.numel()), dtype=m.weight.dtype)
#values = values.view(m.weight.size())
foo = X.rvs(m.weight.numel())
values = torch.zeros_like(m.weight.size())
with torch.no_grad():
m.weight.copy_(values)
elif isinstance(m, nn.BatchNorm2d):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
Then I still see the slowdown, which disappears if I remove the X.rvs() call. A single call to X.rvs() doesn’t take a particularly long time, but the loop iterates over ~300 layers.