Training CNN: Loss does not decrease

I am working on the DCASE 2016 challenge acoustic scene classification problem using CNN. All training data (audio files .wav) are converted into a size of 1024x1024 JPEG of MFCC output.

EPOCH = 10
LR = 0.001

   self.layer1 = nn.Sequential(                     # 1024x1024x3
       nn.Conv2d(3,96,13,3,3),                      # 340x340x96
       nn.MaxPool2d(2),                             # 170x170x96
       nn.Conv2d(96,256,10,2,1),                    # 82x82x256
       nn.MaxPool2d(2),                             # 41x41x256
       nn.Conv2d(256,384,5,2,1),                    # 20x20x384
       nn.Conv2d(384,384,3,1,1),                    # 20x20x384
       nn.Conv2d(384,256,3,1,1),                    # 20x20x256
       nn.MaxPool2d(2))                            # 10x10x256
   self.fc = nn.Linear(10*10*256, 15)

However, with this configuration the loss never decreases but fluctuating throughout the entire run. And the final accuracy will always stuck at 6%. Can anyone help in guiding me? Thank you.

Can you show me your forward function?

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(cnn.parameters(), lr=LR)
for epoch in range(EPOCH):
for i, (img_name,img_label) in enumerate(train_loader):
images = Variable(img_name).cuda()
labels = Variable(img_label).cuda()
outputs = cnn(images)
loss = criterion(outputs, labels)

   print("===> Epoch[{}/{}]({}/{}): Loss: {:.4f}".format(epoch+1,EPOCH, i, len(train_loader),[0]))

forwad function is something like this:

def forward(self, inp):
       x = self.embedding(inp) 

It is defined in your module.

def forward(self, x):
out = self.layer1(x)
out = out.view(out.size(0), -1)
out = self.fc(out)
return out

I am sorry I pasted the wrong info

Sorry for leaving so much time.I meet the same issue while training RNN recently, I will think about it tomorrow. Sorry again!

Hi, from my general experience , 10 epochs may or may not be a good indication for the learnability. specially when data set is fairly large and labels are sparse. I have not looked into DCASE data, but assuming each audio will be converted to multiple frame making over all datasize fairly large. I would recommend training for atleast 50 epochs and experimenting . also lowering the learning rate might help.
I don’t see any obvious issue with the code structure (I am not commenting on the model architecture) and able to train models with similar format.

1 Like

I tried to find the reason from two aspects. On the one hand, The network didn’t work. On the other hand,the loss was not calculated correctly. I think the former is more likely, but I can’t find out any issue. Maybe the problem is in the input? I’m sorry I had not helped.

1 Like

How are your loading the data? Can you share the dataset/pre-processing code? it is very much possible that the network is not learning anything (too less parameters maybe/too shallow etc?)

1 Like

Sorry for the late reply to this post, however I have exam for January and had to move my focus there. Anyway now I am continuing working on my model.

Here is a picture of an audio file after MFCC extraction using librosa libraries. It is currently resize into 512 x 512 (200 DPI) instead of the previous ambitious 1024x1024 (200 DPI) settings. These pictures are all JPG form and resides in a folder. I then have all image name and the image label stored in CSV files.

X_train = loadDataSet(os.path.join(eval_path,“fold1_train.csv”),train_path,ToTensor())
train_loader = DataLoader(X_train, shuffle=True)

X_test = loadDataSet(os.path.join(eval_path,“fold1_evaluate.csv”),test_path,ToTensor())
test_loader = DataLoader(X_train, shuffle=True)

Sorry for the late reply, thanks anyway. I am convinced that it might not be working after all after many debugging the loss is still end up around the same.