So i’m building a genre recognition deep learning model using only CNN , I’m a bit stuck on how to use the CNN to take only a part of my image at a time. Like taking only 96,96 pixels at a time.
I have generated the melspectogram image of the first 30secs of the audio clips in the GTZAN dataset.
Each image has the dim [1,96,1366 ]
How should I build my model ? Should I go over the whole image or take only a part of the image?
Also , posting a bit of code on feeding the image to the model will be very helpful.
One way would be to adapt the ImageFolder class to your dataset, and to add the following transformation which would take a random crop from your image and resize to the size needed by the mode you’re using.
train_transform = transforms.Compose([ transforms.RandomSizedCrop(image_size),
transforms.Normalize(mean=(0.485, 0.456, 0.406),
std=(0.229, 0.224, 0.225))])
Hope this helps!
Thanks for the reply,
But what about the remaining image data , because only a part of the image might not be helpful in analysing the genre of the music.
It all really depends on what you want to achieve. If you think that you need the entire image, you could just scale the image down using transforms.Scale(image_size) instead. But it might be the case that getting random crops helps in training. Best way would be to try both and see what each network learns.
If you’re loading images that large and you plan on just scaling them down, I would recommend preprocessing them all first (create another dataset where all the images are of the size used by the network) as it would greatly speed up training.
Thanks for your input. I try it and post my results.