Why is my model predicting the same value over and over, and why is my loss negative?

Hello! I’m a total noob at machine-learning and have stumbled upon an issue with a model I’m training to recognize note-patterns in midifiles.
I managed to run the model on my notedata, but my turned back negative for all the epochs:

tensor(-0.6253, grad_fn=)
tensor(-0.6354, grad_fn=)
tensor(-0.6475, grad_fn=)

and my output from trying to generate a new melody from a seed of three values stagnates at one value:

[30.0, 63.0, 68.0, 82.4828075170517, 82.22588127851486, 82.24921894073486, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509, 82.24920380115509]

Why is my loss negative, and why does it stagnate at a prediction value so fast?

import os
from mido import MidiFile
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

class MidiFileCollection(object):
    def __init__(self, path):
        self.path = path
        self.files = []
        for filename in os.listdir(path):
            self.files.append(MidiFile(path+filename))

    def longestFile(self):
        longest = 0
        for index, file in enumerate(self.files):
            if index == 0:
                longest = file
            else:
                if file.getLength() > longest.getLength():
                    longest = file
        return file
    def shortestFile(self):
        shortest = 1000000
        for index, file in enumerate(self.files):
            if index == 0:
                shortest = file
            else:
                if file.getLength() < shortest.getLength():
                    shortest = file
        return file
    def mostEvents(self):
        busiest = 0
        tempCount = 0
        for idx, file in enumerate(self.files):
            if idx == 0:
                busiest = file
            else:
                if len(file.notes()) > len(busiest.notes()):
                    busiest = file
        return busiest
    def get(self, index):
        return self.files[index]

path = "pathToMidiFile"
collection = MidiFileCollection(path)

def interpolate(value, leftMin, leftMax, rightMin, rightMax):
    # Figure out how 'wide' each range is
    leftSpan = leftMax - leftMin
    rightSpan = rightMax - rightMin

    # Convert the left range into a 0-1 range (float)
    valueScaled = float(value - leftMin) / float(leftSpan)

    # Convert the 0-1 range into a value in the right range.
    return rightMin + (valueScaled * rightSpan)

class NotesDataSet(Dataset):
    def __init__(self, midifile):
        self.midifile = midifile
        self.train = []
        self.labels = []
        notes = self.midifile.notes()

        for index, note in enumerate(range(len(notes)-3)):
            self.train.append([notes[index],notes[index+1],notes[index+2]])
            self.labels.append(notes[index+3])

    def __len__(self):
        return len(self.train)

    def __getitem__(self, idx):
        current_sample = self.train[idx]
        current_label = self.labels[idx]
        return {
            "sample": torch.tensor(current_sample),
            "label": torch.tensor(interpolate(current_label, 0, 127, 0, 1))
        }

notedata = NotesDataSet(collection.mostEvents())
trainset = DataLoader(notedata, batch_size=10, shuffle=False)

class Net(nn.Module):

    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(3, 6)
        self.fc2 = nn.Linear(6, 6)
        self.fc3 = nn.Linear(6, 4)
        self.fc4 = nn.Linear(4, 1)

    def forward(self, x):
        x = torch.sigmoid(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return x

net = Net()

import torch.optim as optim
optimizer = optim.Adam(net.parameters(), lr=0.0001)
EPOCHS = 3
for epoch in range(EPOCHS):
    for data in trainset:
        sample = data["sample"]
        label = data["label"].float()
        net.zero_grad()
        output = net(sample)

        loss = F.nll_loss(output, label.long())
        loss.backward()
        optimizer.step()
    print(loss)

seed = [46., 63., 53.]
for idx, i in enumerate(range(50)):
    output = net(torch.tensor([seed[idx],seed[idx+1],seed[idx+2]]))
    seed.append(interpolate(output[0].item(), 0, 1, 0, 127))
print(seed)

Hi
I see you are using the nLL_loss, the nLL_loss expect your prediction to be a tensor that contains the log probabilities of each class,which is not provided here. In order to have log probabilities in your prediction you can just add a LogSoftmax layer before returning x, or you can instead use the CrossEntropyLoss .

1 Like

To add to @nassim answer on why your model generates the same output over and over again. It is because the optimizer does not zero-out the gradients and thus the parameter update does not take place. I would suggest, instead of calling zero_grad() method on your model, try to call optimizer.zero_grad() and see if that changes anything.

1 Like

Hey thanks for the answer! I tried putting in log_softmax(x, dim=-1) inside the forward function. I’m training the network 3 midi notes, and a label representing the next note after those 3.
Why is my output only zeroes? And now my loss is 0 at every epoch.
Am i formatting the data the wrong way?

sample: tensor([[47, 53, 74],
[53, 74, 69],
[74, 69, 67],
[69, 67, 71],
[67, 71, 54],
[71, 54, 72],
[54, 72, 66],
[72, 66, 56],
[66, 56, 62],
[56, 62, 58]])
label: tensor([0.5433, 0.5276, 0.5591, 0.4252, 0.5669, 0.5197, 0.4409, 0.4882, 0.4567,
0.4803])
output: tensor([[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.],
[0.]], grad_fn=)

I tried doing optimizer.zero_grad() but I’m getting the same outputs as in my answer to @nassim
My labels are a scaled from midirange 0 -127 to 0 - 1

From what I understood, you have a continuous label and desire an output to be also continuous?

1 Like

I’m not quite sure what you mean. But to specify what kind of input and labels I’m training it on and what kind of output I want:

I’m training it on a tensor of three midinote-values, for instance [60,62,68] and then the next note in that sequence as a label (which is being scaled from 0-127 to 0-1),
and then I want outputs to also be between 0-1 so I can scale them back to 0-127 to see what note it predicted.
I’m getting the input tensor and the label tensor from a midifile where I slide a window of 4 values along for it to learn the patterns in the melody in the file.
Does that answer your question?

Kind of, I was just trying to figure out which task you are trying to solve. NLL would work for a classification task which mostly requires one hot encoding for the labels.
As far as I’m concerned, you want the model to output a number (not a class index for example), which might make it rather a regression problem. In that case, NLL might not be an optimal loss function. Here, I would try or change three things:

  1. Remove the softmax layer
  2. Replace NLL with for example MSE loss
  3. Whether normalise the input to 0-1 as well (because your labels are within that range, this should accelerate training) or unscale your labels to 0 - 127
1 Like

Yes, you are correct, looking for it to output a number.
Okey, I tried scaling the inputs to 0-1, changing NLL to MSE, and removing the softmax layer. I’m finally getting reasonable scaled outputs, but it still gets stuck.

The latter part of the code now looks like this:

import torch.optim as optim
optimizer = optim.Adam(net.parameters(), lr=0.0001)
EPOCHS = 3
for epoch in range(EPOCHS):
    for data in trainset:
        sample = data["sample"]
        label = data["label"]
        #net.zero_grad()
        output = net(sample.float())
        print("sample: ", sample)
        print("label: ", label)
        print("output: ", output)
        loss = F.mse_loss(output, label)
        loss.backward()
        optimizer.zero_grad()
        optimizer.step()
    print(loss)


seed = [interpolate(60,0,127,0,1),
        interpolate(40,0,127,0,1),
        interpolate(42,0,127,0,1)
    ]

for idx, i in enumerate(range(50)):
    output = net(torch.tensor([seed[idx],seed[idx+1],seed[idx+2]]))
    seed.append(output.item())
converted = []
for i in seed:
    converted.append(interpolate(i,0,1,0,127))
print(converted)

a batch of 10 samples looks like this:

sample: tensor([[0.3701, 0.4173, 0.5827],
[0.4173, 0.5827, 0.5433],
[0.5827, 0.5433, 0.5276],
[0.5433, 0.5276, 0.5591],
[0.5276, 0.5591, 0.4252],
[0.5591, 0.4252, 0.5669],
[0.4252, 0.5669, 0.5197],
[0.5669, 0.5197, 0.4409],
[0.5197, 0.4409, 0.4882],
[0.4409, 0.4882, 0.4567]])
label: tensor([0.5433, 0.5276, 0.5591, 0.4252, 0.5669, 0.5197, 0.4409, 0.4882, 0.4567,
0.4803])
output: tensor([[0.4108],
[0.4104],
[0.4150],
[0.4138],
[0.4151],
[0.4155],
[0.4112],
[0.4163],
[0.4156],
[0.4131]], grad_fn=)

And the final output looks like this:

[60.0, 40.0, 42.0, 52.8820336163044, 52.084635734558105, 52.04773300886154, 52.462129801511765, 52.43161219358444, 52.42598783969879, 52.44178220629692, 52.44065809249878, 52.44028717279434, 52.44088897109032, 52.44084733724594, 52.440828412771225, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884, 52.440851122140884]

Looks good to me:)
I think I missed the sigmoid layer of your network (which scales your input to range 0 - 1 already), so I would replace the sigmoid activation with a relu.

In this stage there’s still no update, since you zero-out the gradients after accumulating and calling them. Normally, you would want to keep the gradients so that your optim ‘knows’ where to go. Calling optimizer.zero_grad() before optimizer.step() leads to your model forgetting the ‘tracks’
Make sure to call them in the right order: optimizer.zero_grad() - loss.backward() - optimizer.step()

1 Like

Like this?

import torch.optim as optim
optimizer = optim.Adam(net.parameters(), lr=0.001)
EPOCHS = 3
for epoch in range(EPOCHS):
    for data in trainset:
        sample = data["sample"]
        label = data["label"]
        #net.zero_grad()
        output = net(sample)
        print("sample: ", sample)
        print("label: ", label)
        print("output: ", output)
        loss = F.mse_loss(output, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(loss)

I tried running it and I’m still getting a stuck output. Can it have something to do with my data? I was thinking instead of feeding it with the actual notenumbers feeding it with the distance between the notenumbers so it identifies interval-patterns instead, but I don’t know if that’s a part of the problem here? Should the output normally be more variated with data like this?

The last batch and output looks like this:

output: tensor([[0.5425],
[0.5451],
[0.5628],
[0.5420],
[0.5384]], grad_fn=)
sample: tensor([[0.4882, 0.4567, 0.4803],
[0.4567, 0.4803, 0.5906],
[0.4803, 0.5906, 0.6063],
[0.5906, 0.6063, 0.6299],
[0.6063, 0.6299, 0.6378]])
label: tensor([0.5906, 0.6063, 0.6299, 0.6378, 0.6535])
output: tensor([[0.5357],
[0.5299],
[0.5492],
[0.5672],
[0.5728]], grad_fn=)
tensor(0.0061, grad_fn=)
[45.0, 37.0, 57.0, 61.67518103122711, 63.66642940044403, 67.27606856822968, 68.12517189979553, 68.9210444688797, 69.55568808317184, 69.7746593952179, 69.97868794202805, 70.09793484210968, 70.15601027011871, 70.20184534788132, 70.22643959522247, 70.24091303348541, 70.25082188844681, 70.25624942779541, 70.25965583324432, 70.26180565357208, 70.26303195953369, 70.263811647892, 70.26428854465485, 70.26456105709076, 70.26474273204803, 70.26484870910645, 70.26490926742554, 70.26494711637497, 70.26496982574463, 70.2649849653244, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417, 70.26500010490417]

Did you remove the sigmoid activation after the first layer and replace it with a relu, and add a sigmoid after the last layer?

1 Like

I tried it, but it’s still stuck. My code now looks like this:

import os
from mido import MidiFile
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import random

class MidiFileCollection(object):
    def __init__(self, path):
        self.path = path
        self.files = []
        for filename in os.listdir(path):
            self.files.append(MidiFile(path+filename))

    def longestFile(self):
        longest = 0
        for index, file in enumerate(self.files):
            if index == 0:
                longest = file
            else:
                if file.getLength() > longest.getLength():
                    longest = file
        return file
    def shortestFile(self):
        shortest = 1000000
        for index, file in enumerate(self.files):
            if index == 0:
                shortest = file
            else:
                if file.getLength() < shortest.getLength():
                    shortest = file
        return file
    def mostEvents(self):
        busiest = 0
        tempCount = 0
        for idx, file in enumerate(self.files):
            if idx == 0:
                busiest = file
            else:
                if len(file.notes()) > len(busiest.notes()):
                    busiest = file
        return busiest
    def get(self, index):
        return self.files[index]

path = "path"
collection = MidiFileCollection(path)

def generateDatapoint(notes):
    out = []
    for i in range(128):
        if i in notes:
            out.append(1)
        else:
            out.append(0)
    return out

def interpolate(value, leftMin, leftMax, rightMin, rightMax):
    # Figure out how 'wide' each range is
    leftSpan = leftMax - leftMin
    rightSpan = rightMax - rightMin

    # Convert the left range into a 0-1 range (float)
    valueScaled = float(value - leftMin) / float(leftSpan)

    # Convert the 0-1 range into a value in the right range.
    return rightMin + (valueScaled * rightSpan)

class NotesDataSet(Dataset):
    def __init__(self, midifile):
        self.midifile = midifile
        self.train = []
        self.labels = []
        notes = self.midifile.notes()

        for index, note in enumerate(range(len(notes)-3)):
            self.train.append([
            interpolate(notes[index],0,127,0,1),
            interpolate(notes[index+1],0,127,0,1),
            interpolate(notes[index+2],0,127,0,1)])
            self.labels.append(notes[index+3])

    def __len__(self):
        return len(self.train)

    def __getitem__(self, idx):
        current_sample = self.train[idx]
        current_label = self.labels[idx]
        return {
            "sample": torch.tensor(current_sample),
            "label": torch.tensor(interpolate(current_label,0,127,0,1))
        }

notedata = NotesDataSet(collection.mostEvents())
trainset = DataLoader(notedata, batch_size=5, shuffle=False)

class Net(nn.Module):

    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(3, 6)
        self.fc2 = nn.Linear(6, 6)
        self.fc3 = nn.Linear(6, 4)
        self.fc4 = nn.Linear(4, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return F.sigmoid(x)

net = Net()

import torch.optim as optim
optimizer = optim.Adam(net.parameters(), lr=0.00001)
EPOCHS = 3
for epoch in range(EPOCHS):
    for data in trainset:
        sample = data["sample"]
        label = data["label"]
        output = net(sample)
        print("sample: ", sample)
        print("label: ", label)
        print("output: ", output)
        optimizer.zero_grad()
        loss = F.mse_loss(output, label)
        loss.backward()
        optimizer.step()
    print(loss)


seed = [interpolate(random.randint(30,70),0,127,0,1),
        interpolate(random.randint(30,70),0,127,0,1),
        interpolate(random.randint(30,70),0,127,0,1)
    ]

for idx, i in enumerate(range(50)):
    output = net(torch.tensor([seed[idx],seed[idx+1],seed[idx+2]]))
    seed.append(output.item())
converted = []
for i in seed:
    converted.append(interpolate(i,0,1,0,127))
print(converted)

I added it now, but it still gets stuck. My code now looks like this:

import os
from mido import MidiFile
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import random

class MidiFileCollection(object):
    def __init__(self, path):
        self.path = path
        self.files = []
        for filename in os.listdir(path):
            self.files.append(MidiFile(path+filename))

    def longestFile(self):
        longest = 0
        for index, file in enumerate(self.files):
            if index == 0:
                longest = file
            else:
                if file.getLength() > longest.getLength():
                    longest = file
        return file
    def shortestFile(self):
        shortest = 1000000
        for index, file in enumerate(self.files):
            if index == 0:
                shortest = file
            else:
                if file.getLength() < shortest.getLength():
                    shortest = file
        return file
    def mostEvents(self):
        busiest = 0
        tempCount = 0
        for idx, file in enumerate(self.files):
            if idx == 0:
                busiest = file
            else:
                if len(file.notes()) > len(busiest.notes()):
                    busiest = file
        return busiest
    def get(self, index):
        return self.files[index]

path = "path"
collection = MidiFileCollection(path)

def generateDatapoint(notes):
    out = []
    for i in range(128):
        if i in notes:
            out.append(1)
        else:
            out.append(0)
    return out

def interpolate(value, leftMin, leftMax, rightMin, rightMax):
    # Figure out how 'wide' each range is
    leftSpan = leftMax - leftMin
    rightSpan = rightMax - rightMin

    # Convert the left range into a 0-1 range (float)
    valueScaled = float(value - leftMin) / float(leftSpan)

    # Convert the 0-1 range into a value in the right range.
    return rightMin + (valueScaled * rightSpan)

class NotesDataSet(Dataset):
    def __init__(self, midifile):
        self.midifile = midifile
        self.train = []
        self.labels = []
        notes = self.midifile.notes()

        for index, note in enumerate(range(len(notes)-3)):
            self.train.append([
            interpolate(notes[index],0,127,0,1),
            interpolate(notes[index+1],0,127,0,1),
            interpolate(notes[index+2],0,127,0,1)])
            self.labels.append(notes[index+3])

    def __len__(self):
        return len(self.train)

    def __getitem__(self, idx):
        current_sample = self.train[idx]
        current_label = self.labels[idx]
        return {
            "sample": torch.tensor(current_sample),
            "label": torch.tensor(interpolate(current_label,0,127,0,1))
        }




notedata = NotesDataSet(collection.mostEvents())
trainset = DataLoader(notedata, batch_size=5, shuffle=False)

class Net(nn.Module):

    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(3, 6)
        self.fc2 = nn.Linear(6, 6)
        self.fc3 = nn.Linear(6, 4)
        self.fc4 = nn.Linear(4, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return F.sigmoid(x)

net = Net()

import torch.optim as optim
optimizer = optim.Adam(net.parameters(), lr=0.00001)
EPOCHS = 3
for epoch in range(EPOCHS):
    for data in trainset:
        sample = data["sample"]
        label = data["label"]
        #net.zero_grad()
        output = net(sample)
        print("sample: ", sample)
        print("label: ", label)
        print("output: ", output)
        optimizer.zero_grad()
        loss = F.mse_loss(output, label)
        loss.backward()
        optimizer.step()
    print(loss)


seed = [interpolate(random.randint(30,70),0,127,0,1),
        interpolate(random.randint(30,70),0,127,0,1),
        interpolate(random.randint(30,70),0,127,0,1)
    ]

for idx, i in enumerate(range(50)):
    output = net(torch.tensor([seed[idx],seed[idx+1],seed[idx+2]]))
    seed.append(output.item())
converted = []
for i in seed:
    converted.append(interpolate(i,0,1,0,127))
print(converted)

That’s strange, I’m trying to reproduce your result but getting the model learning instead.

import torch
import torch.optim as optim
import torch.nn.functional as F
import torch.nn as nn

class Net(nn.Module):

    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(3, 6)
        self.fc2 = nn.Linear(6, 6)
        self.fc3 = nn.Linear(6, 4)
        self.fc4 = nn.Linear(4, 1)

    def forward(self, x):
        # x = torch.sigmoid(self.fc1(x))
        x = F.relu(self.fc1(x))
        # x = self.fc1(x)
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = F.sigmoid(self.fc4(x))
        return x

net = Net()
optimizer = optim.Adam(net.parameters(), lr=0.0001)
EPOCHS = 100
samples = torch.FloatTensor(20, 3).random_(0, 127) / 127.
targets = torch.FloatTensor(20, 1).random_(0, 127) / 127.
# print(samples, targets)
# exit(0)
myloss = 1
for epoch in range(EPOCHS):
    for sample, label in zip(samples, targets):
        optimizer.zero_grad()
        output = net(sample)
        # print("sample: ", sample)
        # print("label: ", label)
        # print("output: ", output)

        loss = F.mse_loss(output, label)
        loss.backward()
        optimizer.step()
    print(loss.item())
#0.044560372829437256
0.04441039264202118
0.04426462948322296
0.04411971941590309
0.04397529736161232
0.04383138567209244
0.043687961995601654
0.04354503005743027
0.04340260848402977
0.04326070472598076
0.043119292706251144
0.04297839477658272
0.04283802583813667
0.04269819334149361
0.04255888611078262
0.042420100420713425
0.0422818586230278
0.04214416444301605
0.042006995528936386
0.04187038168311119
0.04173431918025017
0.04159880802035332
0.04146382957696915
0.041329436004161835
0.0411955751478672
0.04106230288743973
0.04092956334352493
0.040797386318445206
0.040665797889232635
0.04053475707769394
0.04040428623557091
0.04027441889047623
0.04014524072408676
0.040016673505306244
0.03988874331116676
0.03976137191057205
0.03963275998830795
0.03950601443648338
0.03937847539782524
0.03925103694200516
0.03912412375211716
0.038997795432806015
0.038873571902513504
0.03874838724732399
0.03862551227211952
0.03850169479846954
0.03837968036532402
0.03825680911540985
0.0381363108754158
0.038014840334653854
0.03789515420794487
0.03777654469013214
0.037656690925359726
0.037538737058639526
0.03741999715566635
0.03730364516377449
0.03718632832169533
1 Like

My loss is also getting smaller and smaller but the output is always the same. Could it be overfitting somehow? Or is that not what overfitting means?

If you were to generate output with your model, what outputs are you getting?

I have to go to bed now, but thanks for the replies so far!

If you’re getting the loss really close to zero (means your model fits the training data perfectly), then it is overfitting. I would try experimenting with different optimizers (RMSprop, adadelta) and the hyper parameters and train on a small set until the model overfits.