Criterion issues on MNIST using CNN

We are currently implementing a WaveNet like structure to solve the problem of MNIST prediction, ie. given some part of the image, to reconstruct the rest. We’ve trained on the whole MNIST data set for starters, however we run into some problems regarding our model output estimate, namely that it is all gray.

From what we gather of info on PyTorch documentation we want to use binary cross entropy, and our model will use softmax on the output. Does Binary_cross_entropy use softmax, like Cross_entropy? We also tried to use the sigmoid, as the BCELoss function seemed to use it, but to no avail.

No, neither nn.BCELoss nor nn.CrossEntropyLoss expect an output with an applied softmax on it.
nn.BCEWithLogitsLoss as well as nn.CrossEntropyLoss both expect logits (no activation) and are used for a binary classification or multi-label classification, and multi-class classification, repectively.

nn.BCELoss expects probabilities (sigmoid applied on the output) and is numerically less stable than nn.BCEWithLogitsLoss.

Thank you for the quick reply. Alright, it might be the activation that has been the culprit. If we may take a minute of your time, does the following make sense with regards to predicting using the F.binary_cross_entropy function?

Preamble wrt. our work process is as follows:
We’ve normalized the data with mean and std, and we one hot encoded the pixels to 0 and 1’s for a given image. To make it fit into WaveNet, we flattened the vector, as we use Conv1D and Causal dilation layers. Lastly, when we try to make it predict, it is based on the following snippet of code:

MNIST_ = load_mnist(path=path, train=True)

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
print("Training using",device)

model = Wavenet(layers=3,blocks=2,output_size=1, output_channels=1).to(device)
model.train()


optimizer = optim.Adam(model.parameters(), lr=0.0003)
epochs = 5


# for epoch in range(epochs):
for i, batch in tqdm(enumerate(MNIST_)):
    Xtrain, ytrain = batch
    batch_size = Xtrain.shape[0]
    Xtrain = Xtrain.reshape(batch_size,1,784)
    # Xtrain = Xtrain.transpose(1, 2)

    # encoded_MNIST = encode_MNIST(Xtrain)
    
    inputs = Xtrain[:,:,:-1].to(device)
    target = Xtrain[:,:,1:].to(device)
    # inputs = encoded_MNIST[:,:,:-1].to(device)
    # target = encoded_MNIST[:,:,1:].to(device)
    
    output = model(inputs)

    loss = F.binary_cross_entropy(output, target)
    

    print("\nLoss:", loss.item())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if i % 1000 == 0:
        print("\nSaving model")
        torch.save(model.state_dict(), "wavenet_MNIST.pt")

EDIT: Our advisor said that the sliding window could be used as such, even though we might have done the implementation wrong. Essentially, he said that using a sliding window and doing it i times over the remainder of the vector would be the same as merely subtracting a window_size in each tail ends

F.binary_cross_entropy is the functional API version of nn.BCELoss, so your model would need to use torch.sigmoid on its outputs. Also, I assume that you are working on a binary or multi-label classification.
If so, I would recommend to remove the sigmoid and use F.binary_cross_entropy_with_logits(input, target) instead.

1 Like