PyTorch/Tensorflow impact of xavier and kaming_uniform weights initialization

Hi, I am using the following model for training the network:

class FemnistNet(nn.Module):
    def __init__(self):
        super(FemnistNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=2) ##output shape (batch, 32, 28, 28)
        self.pool1 = nn.MaxPool2d(2, stride=2, ) ## output shape (batch, 32, 14, 14)
        
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2) ##output shape (batch, 64, 14, 14)
        self.pool2 = nn.MaxPool2d(2, stride=2) ## output shape (batch, 64, 7, 7)
        
        self.fc1 = nn.Linear(3136, 2048)
        self.fc2 = nn.Linear(2048 ,62)
        
    def forward(self, x):
        x = x.view(-1, 1, 28, 28)
        x = self.conv1(x)
        x = th.nn.functional.relu(x)

        x = self.pool1(x)

        x=self.conv2(x)
        x = th.nn.functional.relu(x)
        
        x = self.pool2(x)
        
        x = x.flatten(start_dim=1)
        
        x = self.fc1(x)
        l1_activations = th.nn.functional.relu(x)
        
        x = self.fc2(l1_activations)

        x = x.softmax()

        return x, l1_activations

Default initializations of weights is kaiming_uniform. It trains the model well. When I initializes the weights using xavier as th.nn.init.xavier_uniform_(self.fc1.weight) then model parameters become nan for dense/linear layers. What is the impact of weights initialization distribution? Why weights become nan in th.nn.init.xavier_uniform_(self.fc1.weight)?

Different distributions work well in Tensorflow. I don’t experience NaNs in Tensorflow.