Two image input vgg 16 model

Hello everybody,
I have a video dataset. I extract one video image frame and extract on audio spectrum as image of the video. I have two main folders -one includes video image frames and the other contains audio spectrums of each videos-. Each two main folder have 8 subfolders - which are the classes.

My model has two inputs -one image frame and one audio spectrum image-. Each input is transferred a pretrained model vgg16 paralelly for feature extraction. Then, result of these two inputs are concatinated into 8192 linear and then transferred the classification step.

So firstly, I create a dataset that combines these two dataset -one image frames, the other audio spectrum images-. So that, each same videos of image frame and audio spectrum images are matched for dataloader.

transform = transforms.Compose([
             transforms.Resize((224, 224)),

class ConcatDataset(
    def __init__(self, firstDataset, secondDataset):
        self.firstDataset = firstDataset  #video image frames of the videos
        self.secondDataset = secondDataset  #audio image spectrum of videos

    def __getitem__(self, i):
        f_x, f_y = self.firstDataset[i] #f_x: video frame image, f_y: class of the f_x
        s_x, s_y = self.secondDataset[i]  #s_x: audio spectrum image , s_y: class of the s_x
        #I checked that f_x and s_x are came from the same video
        return {"videoFrame": f_x, "audioImage":s_x}, f_y

    def __len__(self):
        return len(self.firstDataset)
train_loader = DataLoader(ConcatDataset(datasets.ImageFolder(root=OUTPUT_DIR_OF_VIDEO_IMAGES, transform=transform),
                 datasets.ImageFolder(root=OUTPUT_DIR_OF_SOUND_SPECTOGRAMS, transform=transform)),
                 batch_size=batch_size, shuffle=True)

My model is

vggmodel = vgg16(weights=torchvision.models.VGG16_Weights.DEFAULT)#(pretrained=True)
for param in vggmodel.features.parameters():
    param.require_grad = False
class MyModel(nn.Module):
    def __init__(self):
        m = vggmodel
        for param in m.parameters():
          param.requires_grad = False

        m.classifier[6] = nn.Identity() # replaced final FC layer with identity

        self.vgg16_modified = m
        self.classifier_last = nn.Sequential(
          nn.Linear(8192, 256),
          nn.Linear(256, 8),
    def forward(self, x):        
        y1 = self.vgg16_modified(x["videoFrame"])
        y2 = self.vgg16_modified(x["audioImage"])
        y = torch.concat((y1, y2), 1) #IS IT TRUE ALSO?
        return self.classifier_last(y)  

model = MyModel()

This is my model :smiley:
When I trained, I got low accuracy. I just wondered that my logic is correct. If it is maybe my dataset images are not suitable for this case.

First epoch accuray: 20.000000298023224
Second one: 0.0
Third epoch: :20.000000298023224
Forth one: 60.00000238418579
Fifth one: 20.000000298023224

Thanks for everything,
Best regards.

Hi Director!

The short story is that concatenating feature vectors from your image
frame and audio spectrum and then training a final classifier on the
combined features is a reasonable approach. But I have concerns
about using the same (pretrained) vgg16 on the audio spectrum as
on the image frame.

vgg16 is pretrained (VGG16_Weights.DEFAULT) on “normal” images
(perhaps a picture of some cows or a bicycle or a bottle of ketchup).
You haven’t told us what your use case is or what your video dataset
looks like, but if your “image frame” is a “normal” image, then using
(or starting with) pretrained vgg16 could make sense.

However, a spectrogram (“audio spectrum image”) is very different in
character from a normal image. I could believe that the pretrained vgg16
weights might not be useful in extracting meaningful features from your
spectrogram. If you cannot afford to do any training / fine-tuning of the
vgg16 part of your model, I suspect that your combined model won’t work
very well and that you might be better off using just your image frame as
input, and leaving the spectrogram out entirely.

If you can afford to train / fine-tune the vgg16 part of your model, I would
strongly suggest having two copies of vgg16 in your model – one for the
image frame and a second, with its own weights, for the spectrogram. If
you do this, then given my belief that a spectrogram is very different from a
normal image, you might consider initializing the weights of the vgg16 that
you use for the spectrogram randomly, rather than using pretrained weights.

Either way, I would recommend first training (the normal-image part of) your
final classifier and then – if you can afford the training cost – training and
fine-tuning the normal-image and spectrogram vgg16s, while continuing to
train the final classifier.

These results suggest that you only have five samples in your training
set. (From your quoted accuracies it looks like you get none right, one
out of five right, or three out of five right.) If so, you would need much
more training data to perform any useful training.

Also, it looks like you have only trained for five epochs – also much less
than you would need for any effective training or fine-tuning.

Good luck!

K. Frank

1 Like

Thank you so much!
It is very informative