Always Output of 0?

I have a Network which should do the following task.
It gets an Images as Input. Now it should give me 2 Outputs.The first output is the cords where the mouse should move, the second output ist if it should click or not. First i thougth it could be overfitting. But through the fact that there is far more “click” (1) as “no click” (0) values I am wondering why it still always puts out 0.

My code:

import torch
import torchvision
import torchvision.transforms as transforms

import os
from PIL import Image
from CustomDataset import CustomMouseDataset,Rescale

def load_data():
#    transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))])
#    #Load Recoreded Data
#    with h5py.File('data/video_data_22_7_2018_17_46','r') as data:
#        video = data['video'][()]
#        mouse = data['mouse'][()]
#    video = video[:50]
#    mouse = mouse[:50]    
    transform = transforms.Compose([transforms.ToTensor()])
    train_data = CustomMouseDataset('data/video_data_22_7_2018_17_46',transform)
    train_loader =,batch_size=10,shuffle=True)
    return train_loader

import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 3, 1, 1)
        self.pool1 = nn.MaxPool2d(2)
        self.fc1 = nn.Linear(6*16*16, 20)
        self.fc2a = nn.Linear(20, 2) # Regression
        self.fc2b = nn.Linear(20, 1) # Classification
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x1 = self.fc2a(x)
        x2 = F.log_softmax(self.fc2b(x), dim=1)
        return x1, x2
def main():

if __name__ == "__main__":    
    net = Net()
    train_data = load_data()
    import torch.optim as optim
    criterion = nn.MSELoss()
    criterion2 = nn.NLLLoss()
    optimizer = optim.SGD(net.parameters(), lr = 0.001, momentum = 0.9)
    #Train the Network
    for epoch in range(10):
        running_loss = 0.0
        for i,data in enumerate(train_data,0):
            inputs = data['frame']
            labels = data['mouse']
            target_1 = labels[:,:2]
            target_2 = labels[:,2].unsqueeze(1)
    #        print(type(target_2))
    #        print(target_2.shape)
            #Zero gradients Parameter
            #forward + backward +optimize
            output1,output2 = net(inputs)
            loss1 = criterion(output1,target_1)/2000
            loss2 = criterion(output2,target_2)
            loss = loss1 + loss2
            running_loss = loss
            if i % 300 == 0:    # print every 2000 mini-batches
                print('[%d, %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / 5))
                running_loss = 0.0
    try:           ,'Model/model_save')
    print('Finished Training')

Example Output:

tensor([[ 500.9624,  829.2684],
        [ 442.0272,  770.7896],
        [ 534.6215,  858.7812],
        [ 465.2378,  793.0269],
        [ 516.5006,  844.0679],
        [ 512.5015,  837.9688],
        [ 469.4029,  797.4636],
        [ 462.6033,  787.3254],
        [ 453.6012,  784.0760],
        [ 503.6086,  833.6633]]) tensor([[ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.],
        [ 0.]])

Your classification path has just one output, i.e. one class.
If you use F.log_softmax(x, dim=1) on it, it will “normalize” your single prediction to be always 0.

Change it to self.fc2d = nn.Linear(20, 2) for a two-class classification as suggested here.

Alternatively, you could use nn.BCELoss with a single output and F.sigmoid.

Is there a page or way how to learn what kind of Loss Function/optimizer or Activation Function i have to use?
I used different tutorials but they didnt explained this point

You can find the different loss functions in the docs.

I assume you’ve already found the PyTorch tutorials and would like to get started creating your own models.
There are a lot of good resources to learn more, e.g. Stanford’s CS231n for Visual Recognition (with free Lecture videos),’s course (they use a high-level wrapper built on top of PyTorch) or Andrew NG’s coursera course.

For the beginning you could stick to the following (this is my biased opinion and the recommendations might not be the best for your use case!):

  • For regression, try nn.MSELoss() and no non-linearity for your model output. Also normalizing the target to [0, 1] or [-1, 1] might help.
  • For classification use F.log_softmax + nn.NLLLoss or no non-linearity + nn.CrossEntropyLoss.
  • Try optim.Adam as the default optimizer.
  • Try nn.ReLU as your default non-linearity between layers.

Once your models converge you can tweak your code in a fancy way and e.g. use skip connections, cyclic learning rates etc. The deeplearningbook is also a great resource.

One last question i have. If i now add New Layer to the Network. How can i calculated the input of the first Linear Layer?

Have a look at the output formula for nn.Conv2d and nn.MaxPool2d.
E.g. for a kernel_size=3 you will “lose” 2 pixel in height and width if you don’t pad. Adding padding=1 will keep the same shape.
MaxPool2d with a kernel_size=2 and stride=2 will reduce the spatial dimensions by 2.

These are just common values for these layers and you can design your model as you wish.
For the last layer you would have to multiply the channels by the height and width.
In your example you have 6 channels and a spatial size of 16x16.

super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 3, 1, 1)
        self.conv2 = nn.Conv2d(6,12,3,1,1)
        self.pool1 = nn.MaxPool2d(2)
        self.fc1 = nn.Linear(12*32*32, 20)
        self.fc2a = nn.Linear(20, 2) # Regression
        self.fc2b = nn.Linear(20, 1) # Classification

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        x = F.relu(self.conv2(x))
        x = self.pool1(x)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x1 = self.fc2a(x)
        #x2 = F.log_softmax(self.fc2b(x), dim=1)
        x2 = F.sigmoid(self.fc2b(x))
        return x1, x2

With the Calculation i get always 32x32 so whats wrong in there?

What is your input size? Note that you are pooling twice.

32x32 Is my input image with 1 channel (Gray)

Since you are pooling twice, the spatial dimensions will be 32/2/2=8. Your linear layer should thus take 12*8*8 input features.

If you don’t want to calculate it, you could also print the shape of your tensor before the view operation and just use this sizes.