New to PyTorch, having trouble making predictions once data is loaded using Data Loader

I’m completely new to PyTorch (have previously used tensor flow) and I’m stuck on something I’m working on. I’ve been tasked with using a pretrained model to extract the features from application documents and then compute similarity scores to identify duplicates. I have all of the pdf’s converted to .jpg’s, and I’ve loaded the pretrained model and modified the last layer to extract features.

The folder structure is like this:
root
Application 1
image 1
image 2…
Application 2
image 1
Image 2…

What I’m trying to do is extract features from the images in every sub-directory and calculate the euclidian distance between them and output a similarity matrix. Where I’m having an issue, and this may seem really basic, is actually making the predictions once the data is loaded. Below is the code I have so far, any help would be greatly appreciated.

def get_pretrained_model_notop(model_name): #pull the model and change last layer
pretrained_model = model_name(pretrained=True) #downloads pretrained model weights
for param in pretrained_model.parameters():
param.requires_grad = False #freezes layers
pretrained_model = nn.Sequential(*list(pretrained_model.children())[:-1]) #drops final layer, because we aren’t classifying 1000 imagenet classes
pretrained_model.fc = nn.Sequential(
nn.Flatten() #adds flatten layer at end of model
)
if torch.cuda.is_available(): #uses GPU if available
pretrained_model = pretrained_model.cuda()
return pretrained_model

def get_similarity(pretrained_model,train_imgs): #function to extract features from the model and compute similarity scores
bottleneck_feature_example = pretrained_model(train_imgs)
similarity = euclidean_distances(bottleneck_feature_example)
similarity=similarity/similarity.max()
similarity_df = pd.DataFrame(similarity)
similarity_df=1-similarity_df
return np.round(similarity_df,4)

transforms = transforms.Compose([transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])])

img_dir=‘path’
images = datasets.ImageFolder(img_dir,transform=transforms)
data_loader = torch.utils.data.DataLoader(images,
batch_size=32,
shuffle=True,
num_workers=4)

model_list=[models.densenet201]
model_name=[‘densenet201’]
pretrained_model=[get_pretrained_model_notop(selected_model) for selected_model in model_list]
for data in data_loader:
pred=[get_similarity(pretrained,data) for pretrained in pretrained_model]
pred_label_ensemble=sum(pred) / len(pred)
pred_label_ensemble.columns=page_numbers
prob_output_folder = unzipped.replace(‘MF_loan_document’, ‘MF_loan_document_results’)
pred_label_ensemble.to_csv(prob_output_folder+’/’+‘results.csv’,index=False)

I’m not sure I understand the use case completely and where you are stuck at the moment.
The posted code will iterate the DataLoader and create the similarities for the current batch for all pretrained models.
Would you like to calculate the similarities differently?

PS: You can post code snippets by wrapping them into three backticks ``` :wink:

The predictions themselves weren’t running, I realized it’s because I needed to subset to data[0], since these images aren’t labeled. One follow up question I have is regarding the batches. The application I’m attempting to compute similarity on has 140 pages, so it’s been converted to 140 images. When I set batch size=10, I get a 10x10 matrix of similarity. But when I set batch size=1, it outputs nothing. Any idea why that might be? Ultimately, my goal is to compute pairwise similarity scores between every image.

You would have to create a similarity matrix for all 140x140 images.
Based on the current similarity matrix it seems you are calculating it only on the current batch.
For 10 images, you would thus get a 10x10 matrix and for a single image you should get a single similarity score (image compared to itself).

To calculate the similarities between all images, you could use a nested loop with two DataLoaders.