Softmax returns only 1s and 0s during inference

Hi. First of, thanks for all the work you people constantly put in. I appreciate it.

I am currently training a relatively simple CNN for a classification task (75 classes, a few thousand training samples). The network itself trains nicely, and testing gives me a reasonable accuracy (depending on a few hyperparameters and loss functions, around 60-70%). However, I have run into a problem with inference when I tried to include it in a pipeline.

Forward-pass code so you know the network:

def forward(self, x):
        x = self.noise(x)
        x = self.rotation(x)
        x = self.conv1(x)
        if self.bn:
            x = self.BN1(x)
        x = nnf.relu(x)
        x = self.dropout_conv(x)
        x = nnf.avg_pool2d(x, kernel_size=4)
        x = self.conv2(x)
        if self.bn:
            x = self.BN2(x)
        x = nnf.relu(x)
        x = nnf.max_pool2d(x, kernel_size=2)
        x = self.conv3(x)
        if self.bn:
            x = self.BN3(x)
        x = nnf.relu(x)
        x = nnf.max_pool2d(x, kernel_size=2)
        x = self.conv4(x)
        if self.bn:
            x = self.BN4(x)
        x = nnf.relu(x)
        x = nnf.max_pool2d(x, kernel_size=2)
        x = x.view(-1, self.n_feature*16*16*24)
        x = self.dropout_fc(x)
        x = self.fc1(x)
        x = nnf.relu(x)
        x = self.dropout_fc(x)
        x = self.fc2(x)
        x = nnf.relu(x)
        x = self.dropout_fc(x)
        x = self.fc3(x)
        return x

BN is BatchNorm layers, FC are fully connected.

Probability estimation during testing in my original notebook:

with torch.no_grad():
        for data, target, dindex in test_loader:
            output = model(data)
            lsm = nnf.log_softmax(output/model.temperature, dim=1).to(output.device)
            sm = nnf.softmax(output/model.temperature, dim=1).to(output.device)
            test_loss += nnf.nll_loss(lsm, target.to(output.device), reduction='sum').item()
            pred = lsm.data.max(1, keepdim=True)[1]
            prob = sm.data.max(1, keepdim=True)[0]
    ...

Inference in the pipeline (currently another notebook):

...
    representation = df.make_representation_from_unknown(current_image = sitk_image, target_size=(512,512,512))
    # add batch dimension to image
    tensor_representation = torch.unsqueeze(torch.Tensor(representation), 0)
    
    with torch.no_grad():
        # load network
        network = torch.load(network)
        # set to eval mode
        network.eval()
        # collect results
        logits = network(tensor_representation)
        if verbose:
            print(logits)
        lsm = torch.nn.functional.log_softmax(logits/network.temperature, dim=1)
        sm = torch.nn.functional.softmax(logits/network.temperature, dim=1)
        prediction = lsm.data.max(1, keepdim=True)[1].item()
        probability = sm.data.max(1, keepdim=True)[0].item()
    ...

The original code produces sensible probabilities just fine. The ported version does not (it only produces a 1 and otherwise 0s). The only difference I can see, since the entire model should be loaded, is that of batch sizes, as the test_loader comes with a batch size of 24, while the pipeline will have to make single predictions. The only guess I have thus far is BatchNorm acting up because of the change in batch size.
Is that intuition correct? How do I solve it?

I have in the meantime narrowed down the problem somewhat. Data input is the same, the accuracy (or more precisely the weird probabilities output) is not tied to the batch size. For some reason, the raw logits are just several magnitudes larger (-15 vs -10000 for most cases) when I load the network somewhere else for inference.

Found the issue. The data was not the same after all (the pipeline was missing the normalization step, and I didn’t notice).

Let this be a lesson to anyone getting weird logits out of your network: Print the values, don’t plot the image. :v)