Two image input CNN Model

Hello everybody,

I’m a new CNN learner. I have some videos. each video represents a car. I extract frame images of car from each video and then also extract car’s voice spectrum images from each video. Let say I have 5 types of car. I have two main distinct dataset folders. One main folder contains car pictures. Other main folder contains voice spectrum images of each car. Each of main folders contain five subfolders to distinguish types of cars.

My CNN network will have two input image -one is image of car, other one is voice spectrum of the car-. The images will be passed paralelly through VGG16 feature extraction layers and then will be flattened -combined- and be classified. Ps: I guess, in order to use the VGG16 pre trained weighs, I should use 3 channels. So, I have two paralel VGG16 feature extractions layers. The feature extraction layers will be combined at classification part.

My question is that… :smiley: How should I handle the dataloader and dataset. Should i have two dataloader for each input image. How should i handle the dataset to pass it to my cnn model?

Do you guys have seen similar basic classification project in the forum or somewhere else? I’ve searched many websites before posting this question. But, could not understand clearly.

Really appreciated for your answers


RGB stands for Red, Blue and Green. Each have their own channel and are independent of one another. You could say an RGB image is 3 images stacked into one. The size is (3, H, W).

The point I’m getting at is passing in n images as channels is a perfectly valid way to feed data into a model.


conv2d=nn.Conv2d(in_channels=8, out_channels=32, kernel=(3,3))

image1=torch.rand(1, 3, 32, 32)
image2=torch.rand(1, 5, 32, 32)[image1, image2], dim=1)

1 Like

Thank you so much for you quick answer! I understand your point. Sorry, I forgot to mention that the two images will be passed through VGG16 Pre trained network seperately. As far as, I know in order to use pre-trained weights, ı should not change the vgg feature extraction layers. It takes 3 channels at first layer. So I guess i have to use the images seperately. I should create a model that takes two input parallely. But how can i use the dataloader that points to dataset folder. Should i have two dataset loader?

Thank you so much,

If you’re using a pretrained model, then you’re likely retraining the final output layer. In fact, you can do the same with the first input layer. Here’s tutorial on how to do that:

At any rate, regarding the dataset, you can create a custom one via:

class CustomImageDataset(Dataset):
    def __init__(self, annotations_file, img_dir1, img_dir2, transform=None, target_transform=None):
        self.img_labels = pd.read_csv(annotations_file)
        self.img_dir1 = img_dir1
        self.img_dir2 = img_dir2
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.img_labels)

    def __getitem__(self, idx):
        img_path1 = os.path.join(self.img_dir1, self.img_labels.iloc[idx, 0])
        image1 = read_image(img_path1)
        label = self.img_labels.iloc[idx, 1]
        if self.transform:
            image1 = self.transform(image1)

        # repeat the above for self.img_dir2

        if self.target_transform:
            label = self.target_transform(label)
        return image1, image2, label