I have tried to investigate how the performance of the model is affected if the weights of the first layer is initialized to be really large and the weights of the second layer is initialized to be really small. It seems that the model cannot learn the right gradients for it to converge. Why is this the case when the first and second layer of weights can compromise with each other?
class Net(nn.Module):
def __init__(self) -> None:
super().__init__()
self.conv1 = torch.nn.Conv1d(141,100,5)
self.conv2 = torch.nn.Conv1d(100,1,1)
self.relu = nn.ReLU()
def forward(self,X):
X = X.transpose(2,1)
X = self.conv1(X)
X = self.relu(X)
X = self.conv2(X)
return X.transpose(2,1)
model = Net()
model.conv1.weight.data *= 1000
model.conv1.bias.data *= 1000
model.conv2.weight.data /= 1000
# start training here