How to take pairwise input image after combing the models

I have pre-trained VGG models on two different but related types of images. Then I have combined these two models.

class MyEnsemble(nn.Module):
    def __init__(self, modelA, modelB, nb_classes=2):
        super(MyEnsemble, self).__init__()
        self.modelA = modelA
        self.modelB = modelB
        # Remove last linear layer
        self.modelA.classifier[6] = nn.Identity()
        self.modelB.classifier[6] = nn.Identity()
        # Create new classifier
        self.classifier = nn.Linear(4096+4096, nb_classes)
    def forward(self, x1, x2):
        x1 = self.modelA(x1)  # clone to make sure x is not changed by inplace methods
        x2 = self.modelB(x2)
        x =, x2), dim=1)        
        x = self.classifier(F.relu(x))
        return x

# Train your separate models
# ...
# We use pretrained torchvision models here
modelA = models.vgg16(pretrained=True)
num_ftrs = modelA.classifier[6].in_features
modelA.classifier[6] = nn.Linear(num_ftrs,2)
modelB = models.vgg16(pretrained=True)
num_ftrs = modelB.classifier[6].in_features
modelB.classifier[6] = nn.Linear(num_ftrs,2)
model = MyEnsemble(modelA, modelB)  

Now I want to test the combined model using test images. Can anyone please help me with how to give pair of input images? or how can I test the combined model using two different but related types of images?

You can directly pass the two input tensors to the model and would get the output as in a standard use case:

x1 = torch.randn(1, 3, 224, 224)
x2 = torch.rannd(1, 3, 224, 224)
out = model(x1, x2)

I’m not sure where you are currently stuck, so feel free to explain the trouble a bit more. :slight_smile:

Thanks, @ptrblck for your reply. Maybe my question is very silly. Still, I am very much confusing. If I have 1 type of image I can test the model for classification using the following code

def test(model, criterion): 
        running_corrects = 0
        pred = []
        true = []
        output =[]
        pred_wrong = []
        true_wrong = []
        image = []
        for j, (inputs, labels) in enumerate(test_loader):
            inputs =
            labels =
            outputs = model(inputs)
            loss = criterion(outputs, labels)          
            outputs = sm(outputs)
            _, preds = torch.max(outputs, 1)
            running_loss += loss.item() * inputs.size(0)
            running_corrects += torch.sum(preds ==
            preds = preds.cpu().numpy()
            labels = labels.cpu().numpy()
            preds = np.reshape(preds,(len(preds),1))
            labels = np.reshape(labels,(len(preds),1))
            inputs = inputs.cpu().numpy()                
            for i in range(len(preds)):
        mat_confusion=confusion_matrix(true, pred)    
        print('Confusion Matrix:\n',mat_confusion)
feature_extract = True
sm = nn.Softmax(dim = 1)
test_transforms = transforms.Compose([transforms.Resize((224,224)),
                                      transforms.Normalize([0.485, 0.456, 0.406], 
                                         [0.229, 0.224, 0.225])
test_data= datasets.ImageFolder(test_dir,transform=test_transforms)
num_workers = 0
print("Number of Samples in Test ",len(test_data))
test_loader =, batch_size, 
     num_workers=num_workers, shuffle=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()
test(model, criterion) 

But in this case, we have two types of input. How we will use these two types of input images in the test function?

The accuracy computation wouldn’t need to be changes, since your model would still output a single prediction for each input pair. However, you would want to change the data loading pipeline to get both images. For this you could write a custom Dataset and return both images in the __getitem__ as well as the label. This will make sure that the DataLoader loop will yield a batch of both images and the target tensors.

Thanks for your kind suggestion. I have written a custom Dataset and call the above test code. But I am getting an error. Can you please tell me whether I am going right direction or something problem?

from import Dataset
class bothDataset(Dataset):
     def __init__(self, csv_path, root_dir1, root_dir2, transform=None):
        self.img_names = pd.read_csv(csv_path)
        self.transform = transform
        self.root_dir1 = root_dir1
        self.root_dir2 = root_dir2

    def __len__(self):
        return len(self.img_names)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        img_name1= os.path.join(self.root_dir1,
                                self.img_names.iloc[idx, 0])
        image1 ='.bmp')
        res = transforms.Resize((224, 224))
        image1  = res(image1)
        img_name2 = os.path.join(self.root_dir2,
                                  self.img_names.iloc[idx, 1])
        image2 ='.bmp')
        image2  = res (image2)
        label = self.img_names.iloc[idx, 2]            
        return image1, image2, label  
test_loader = bothDataset(csv_path='RelationBS1.csv',
                               root_dir1=test_dir1, root_dir2=test_dir2, 
print('Num test images: ', len(test_loader))

changed the test function part as

for j, (input1, input2, labels) in enumerate(test_loader):
                input1 =
                input2 =
                labels =
                outputs = model(input1, input2)

The Dataset code looks alright, but you would need to change the DataLoader loop into:

for j, (input1, input2, labels) in enumerate(test_loader):

Thanks. Now I am getting 2 problems.
If I use
labels =
AttributeError: ‘numpy.int64’ object has no attribute 'to
If I did not use CUDA. The above problem is solved. But get this error

RuntimeError: Expected 4-dimensional input for 4-dimensional weight 64 3 3 3, but got 3-dimensional input of size [1, 224, 224] instead

The second error is raised, if your input tensors do not have the expected 3 channels, but are apparently missing the channel dimension (and might thus be grayscale images originally).
If that’s the case, add the channel dimension via unsqueeze and change the first conv layer to accept a single input channel by replacing it in the pretrained models.

Thanks @ptrblck for helping me. But I am not sure whether I am getting it correctly or not. Do you mean

input1= input1.unsqueeze(0)
input2= input2.unsqueeze(0)


modelA.features[0]= nn.Conv2d(1, 64, kernel_size=3, stride=1, padding=1)

Can you please explain to me a little bit?

It depends where you are using these methods.
You should check the shape of the image tensors inside the __getitem__ method and if both have only two dimensions, then use unsqueeze(0). Alternatively, you could also check the shape of the batch returned by the DataLoader and call unsqueeze(1), if needed, but I would prefer to use the former approach (inside __getitem__).

Depending on the models you are using, the replacement of the conv layer might work. However, you need to check, if the first conv layer is indeed accessible via model.features[0] or model.conv1 etc. To check it, have a look at the source code of the model or use print(model) to see the layer names.

@ptrblck Thank You so much for your help.