Why my initial loss is bigger than the expected?

I am trying to perform a simple binary classification with a neural network on the make_moons dataset.

Because of random initialization, I expect the first values to be equally splited between correct / incorrect with 50% chance. This would lead to a value of the loss (cross entropy) as -ln(2)=0.69 but my initial loss is 1.684.

What could it be?

I am using a simple PyTorch 2 layers NN:

class TorchNet(nn.Module):
    def __init__(self, inp:int, out:int, hid:int,  n_layers:int, actf: str = 'relu':
        super(TorchNet, self).__init__()
        opt = ['relu', 'sigm', 'tanh']
        err = 'Select a correct activation function from: {}'.format(opt)
        assert actf in opt, err
        self.n_lay = n_layers
        self.fcInp = nn.Linear(inp, hid, bias=False)
        self.fcHid = nn.ModuleList([nn.Linear(hid, hid, bias=False) for _ in range(self.n_lay)])        
        self.fcOut = nn.Linear(hid, out, bias=False)
        if actf == 'relu': self.actf = nn.ReLU(inplace=True)
        if actf == 'sigm': self.actf = nn.Sigmoid()
        if actf == 'tanh': self.actf = nn.Tanh()
    def forward(self, x):
        # Input Layer 
        x = self.actf(self.fcInp(x))

        # Hidden Layers
        for l in range(self.n_lay):

            x = self.actf(self.fcHid[l](x))
            # Apply recursivity to the last layer
            if l == max(range(self.n_lay)) and self.recursive is not None:
                for _ in range(self.recursive):
                    x = self.actf(self.fcHid[l](x))

        # Output Layer 
        x = self.fcOut(x)
        return x

Could you check, if all prediction are biased towards one specific class?
I tried your model with some dummy inputs and get a pretty decent loss:

model = TorchNet(2, 2, 2, 2)
x = torch.randn(100, 2)
target = torch.randint(0, 2, (100,))
criterion = nn.CrossEntropyLoss()

output = model(x)
loss = criterion(output, target)
> tensor(0.6932, grad_fn=<NllLossBackward>)

If your loss is higher, you might want to check your initializations.

I think you are right and there must be something wrong with the initializations or maybe something else I cannot figure out. To get more insights I run some experiments:

These are the models (3 hidden layers of width 10):
Code of the models is here

modelN = TorchNet('No Activation', inp_dim, n_class, lay_size, n_layers, actf='none', track_stats=True, recursive=0)
modelS = TorchNet('Sigmoid', inp_dim, n_class, lay_size, n_layers, actf='sigm', track_stats=True, recursive=0)
modelT = TorchNet('TanH', inp_dim, n_class, lay_size, n_layers, actf='tanh', track_stats=True, recursive=0)
modelR = TorchNet('ReLU', inp_dim, n_class, lay_size, n_layers, actf='relu', track_stats=True, recursive=0)

The results changes from different runs so I guess they are very sensitive to the initialization?
How could I properly initialize them to be sure that the problem is somewhere else? I haven’t manually code any specific initialization but the default random I guess.

I also leave here the code for training:

def train_epoch(model, tr_loader, criterion, optimizer, lr, results):
    train_loss = 0     
    correct, total = 0, 0       
    # Run minibaches from the training dataset
    for i, (X, labels) in enumerate(tr_loader):
        X, labels = Variable(X), Variable(labels)
        # Forward pass
        y_pred = model(X)
        s, preds = torch.max(y_pred.data, 1)
        # Compute loss 
        loss = criterion(y_pred, labels)            
        # Backward pass
        # Collect stats 
        train_loss += loss.item()

        # Compute and store epoch results
        total += y_pred.size(0)
        correct += int(sum(preds == labels)) 
    lss = round((train_loss / i+1), 3)
    acc = round((correct / total) * 100, 2)
    return lss, acc

def valid_epoch(model, ts_loader, criterion, results):
    valid_loss = 0
    correct, total = 0, 0
    with torch.no_grad():
        for i, (X, labels) in enumerate(ts_loader):
            X, labels = Variable(X), Variable(labels)
            # Forward pass
            y_pred = model(X)
            s, preds = torch.max(y_pred.data, 1)
            # Compute loss 
            loss = criterion(y_pred, labels)           
            valid_loss += loss.item()
            # Compute and store epoch results
            total += y_pred.size(0)
            correct += int(sum(preds == labels)) 
    lss = round((valid_loss/i+1), 3)
    acc = round((correct / total) * 100, 3)
    return lss, acc

Then, from the main.py:

models += [modelN, modelS, modelT, modelR]

for model in models:
    r = Results()
    optimizer = optim.SGD(model.parameters(), LR, MOMEMTUM, WEIGHT_DECAY, nesterov=NESTEROV)
    model_no_recursive_params = [model, criterion, optimizer, r]
    train_no_recursive_params = [EPOCHS, LR]
    train(*model_no_recursive_params, *train_no_recursive_params)

You could adapt this initialization using your non-linearities:

def weight_init(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_uniform_(m.weight, gain=nn.init.calculate_gain('relu'))
        if m.bias is not None:

In most runs the output of the model was randomly distributed, although not in every run.