I am using the MSE loss to regress values and for some reason I get nan outputs almost immediately. The first input always comes through unscathed, but after that, the loss quickly goes to infinity and the prediction comes out as a matrix nan. Why might this be happening?

I’ve checked my inputs and GT and those values are correct and not all 0’s. My training loop is below:

```
# MSE loss
c_label_criterion = nn.MSELoss(size_average=True).cuda()
# Instantiate optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
# Run all epochs
for ep in xrange(num_epochs):
# Shuffle the indices
random.shuffle(indices)
# Run though the dataset completely
for i in indices:
# Load batch of data
img, c_gt = dataset.next_input() # B X 3 x M X N image, B x 2 x M x N ground truth
# Put the data on the GPU
img = img.cuda()
#------------------------------------------------------------------
# Run a forward pass
#------------------------------------------------------------------
# Zero out the gradients
optimizer.zero_grad()
# Forward pass
c_pred = model(img)
#------------------------------------------------------------------
# Backpropagate
#------------------------------------------------------------------
# Compute component losses
c_loss = c_label_criterion(c_pred, c_gt)
# Perform backpropagation
c_loss.backward()
# Update the weights
optimizer.step()
```

I initialize my model weights (I’m training from scratch) with the following:

```
# Initialize conv weights with Gaussian random values
for m in model.modules():
if isinstance(m, nn.Conv2d):
n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
m.weight.data.normal_(0, math.sqrt(2. / n))
```

My architecture is basically FCN but with some minor alterations. I don’t perform batch normalization anywhere in the network. My GT data is in the range [0,1], but the predictions after the first input are on the order of 100’s. For example, after one input, the MSE loss value (with `size_average=True`

) is 1022004.8750. I thought perhaps I was dealing with exploding gradients, but I have ReLUs throughout. Here is an example block:

```
self.conv_block4 = nn.Sequential(
nn.Conv2d(256, 256, 3, padding=1), # 256 in, 256 out
nn.ReLU(inplace=True), # ReLU nonlinearity
nn.Conv2d(256, 256, 3, padding=1), # 256 in, 256 out
nn.ReLU(inplace=True), # ReLU nonlinearity
nn.Conv2d(256, 256, 3, padding=1), # 256 in, 256 out
nn.ReLU(inplace=True), # ReLU nonlinearity
nn.MaxPool2d(2, stride=2, ceil_mode=True)
```

Could anyone shed some light on what may be occurring?