I am working on the DCASE 2016 challenge acoustic scene classification problem using CNN. All training data (audio files .wav) are converted into a size of 1024x1024 JPEG of MFCC output.
However, with this configuration the loss never decreases but fluctuating throughout the entire run. And the final accuracy will always stuck at 6%. Can anyone help in guiding me? Thank you.
Hi, from my general experience , 10 epochs may or may not be a good indication for the learnability. specially when data set is fairly large and labels are sparse. I have not looked into DCASE data, but assuming each audio will be converted to multiple frame making over all datasize fairly large. I would recommend training for atleast 50 epochs and experimenting . also lowering the learning rate might help.
I donāt see any obvious issue with the code structure (I am not commenting on the model architecture) and able to train models with similar format.
I tried to find the reason from two aspects. On the one hand, The network didnāt work. On the other hand,the loss was not calculated correctly. I think the former is more likelyļ¼ but I canāt find out any issue. Maybe the problem is in the input? Iām sorry I had not helped.
How are your loading the data? Can you share the dataset/pre-processing code? it is very much possible that the network is not learning anything (too less parameters maybe/too shallow etc?)
Sorry for the late reply to this post, however I have exam for January and had to move my focus there. Anyway now I am continuing working on my model.
Here is a picture of an audio file after MFCC extraction using librosa libraries. It is currently resize into 512 x 512 (200 DPI) instead of the previous ambitious 1024x1024 (200 DPI) settings. These pictures are all JPG form and resides in a folder. I then have all image name and the image label stored in CSV files.
Sorry for the late reply, thanks anyway. I am convinced that it might not be working after all after many debugging the loss is still end up around the same.