FashionMNIST- CNN architecture and have a poblem with the train-loss which show : nan

We decided to implement the CNN architecte which looks like this

implement the MNISTNet network architecture

class FashionMNISTNet(nn.Module):

# define the class constructor
def __init__(self):
    
    # call super class constructor
    super(FashionMNISTNet, self).__init__()
    
    # specify convolution layer 1
    self.layer_1 = nn.Sequential(
        nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1),
        nn.BatchNorm2d(32),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2)
    )

    #specify convultion layer 2
    self.layer_2 = nn.Sequential(
        nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3),
        nn.BatchNorm2d(64),
        nn.ReLU(),
        nn.MaxPool2d(2)
    )

    self.linear1 = nn.Linear(64*6*6, 600)

    self.drop = nn.Dropout2d(0.25)

    self.linear2 = nn.Linear(600, 120)
    
    self.linear3 = nn.Linear(120, 10)
    
    # add a softmax to the last layer
    # self.logsoftmax = nn.LogSoftmax(dim=1) # the softmax
    
# define network forward pass
def forward(self, images):
    
    x = self.layer_1(images)

    x = self.layer_2(x)

    x = x.view(x.size(0), -1)

    x = self.linear1(x)

    x = self.drop(x)

    x = self.linear2(x)

    x = self.linear3(x)

    # define layer 3 forward pass
    # x = self.logsoftmax(self.linear3(x))
    
    # return forward pass result
    return x

When we try to run the Network Training with the same code (with the exception of mini_batch_size = 100), all the train_epoch_loss display “nan”:

and the error show this:

Does anyone know how to fix this error?

Thank you very much!

You’ll need to reshape/unsqueeze your inputs as [10000, 1, 28, 28]. You need to this this because the network expects inputs of shape [_, channel, height, width].

Thanks! On which line should I implement this?

You could do it in your forward function as images=images.unsqueeze(dim=1). However, it would be better if you include this step in your data preprocessing pipeline.

You mean here :

It does not seem to work, could you be more precise?
I am fairyl new to the coding world. :smiley:

def forward(self, images):
    images=images.unsqueeze(dim=1)
    x = self.layer_1(images)
    x = self.layer_2(x)
    x = x.view(x.size(0), -1)
    .
    .
    .

You need to perform the step before you begin passing it into the network.

Thanks. now this work:

define network forward pass

def forward(self, images):

    images=images.unsqueeze(dim=1)

         

    x = self.layer_1(images)

    x = self.layer_2(x)

    x = x.view(x.size(0), -1)

    

    x = self.linear1(x)

    x = self.drop(x)

    x = self.linear2(x)

    x = self.linear3(x)

    

    # define layer 3 forward pass

    # x = self.logsoftmax(self.linear3(x))

    

    # return forward pass result

    return x

But it still leads to an issue here:

init collection of training epoch losses

train_epoch_losses = []

set the model in training mode

model.train()

train the MNISTNet model

for epoch in range(num_epochs):

# init collection of mini-batch losses

train_mini_batch_losses = []



# iterate over all-mini batches

for i, (images, labels) in enumerate(fashion_mnist_train_dataloader):

    

    # push mini-batch data to computation device

    images = images.to(device)

    labels = labels.to(device)

    # run forward pass through the network

    outputs = model(images)

            

    # reset graph gradients

    model.zero_grad()

    

    # determine classification loss

    loss = nll_loss(outputs, labels)

    

    # run backward pass

    loss.backward()

    

    # update network paramaters

    optimizer.step()

    

    # collect mini-batch reconstruction loss

    train_mini_batch_losses.append(loss.data.item())



# determine mean min-batch loss of epoch

train_epoch_loss = np.mean(train_mini_batch_losses)



# print epoch loss

now = datetime.utcnow().strftime("%Y%m%d-%H:%M:%S")

print('[LOG {}] epoch: {} train-loss: {}'.format(str(now), str(epoch), str(train_epoch_loss)))



# set filename of actual model

model_name = 'fashion_mnist_model_epoch_{}.pth'.format(str(epoch))

# save current model to GDrive models directory

torch.save(model.state_dict(), os.path.join(models_directory, model_name))

        

# determine mean min-batch loss of epoch

train_epoch_losses.append(train_epoch_loss)

with this error message: