Extracting features from pre-trained CNN, but got same value of any frame

Hi, I’m new in PyTorch. I tried to get the inner layer of ResNet-101 and also Vgg (both pretrained) as features of images. But when I took a look at the features, the output from ResNet are all having same value, and also output from Vgg…

Is there anyone knows the problem? Thank you.

What is your use case and when do the features look the same?
Also how did you visualize them?

I tried to extract features for frames in videos. So I first split video into frames with 32fps. And then I use the following code to extract pool 5 layer as features (save to .pt).

def Preprocess(input_image):
    transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    input_tensor = transform(input_image)
    batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model
    return batch

input_image = Image.open(inputFile)
input_batch = Preprocess(input_image)
model = models.resnet101(pretrained=True)

res101_conv = nn.Sequential(*list(model.children())[:-1])

for param in res101_conv.parameters():
     param.requires_grad = False
outputs = res101_conv(input_batch)
result = outputs[0].flatten()

And I check the output of each frames, frame 1 feature is similar to frame 1000. I used consine similarity to check. this is the code.

cosine = nn.CosineSimilarity(dim=1, eps=1e-6)
print(cosine(frame_one.view(1, -1), frame_two.view(1, -1)))

What are these frames showing, what is the prediction, and which features do you expect?
Is the prediction wrong for both frames or why are you concerned?

Hi I extract these features just to use them for representing a video (I want to use these feature to analyze videos), not to classify the object or images. But the extracted features similarity is 0.9 up. And I think it is weird that actually the video has different frames and each frame are having different content, the extracted features should not be that similar. Or am I wrong with this concept?

It depends, what the input frames are showing and what the model predicts.
E.g. if the model predicts the same class for all input frames, I would assume that the extracted features are quite equal.
Depending on your current use case and input, the model might predict a single random class or just completely random output, since it wasn’t trained on your dataset.

Did you check, what the model prediction is for these frames? If the prediction is equally wrong, the model isn’t able to extract useful features from your frames.

Sorry for the late reply.

Actually the frames are extracted from a random video and there is no label for the object classification of each frame. I just tried to use the deep features to see if I can find the similarity between frames and find the boundary of videos. Just like you said, the model is not trained on my dataset so I think maybe the pre-trained resnet is not suitable in my case. Thank you so much for helping me to figure out the reason.

This might be a possible reason, but you could still run some sanity checks with your current video data.
E.g. if your video contains some frames with ImageNet objects, such as dogs, I would recommend to try to classify them using your current code. If the model fails completely, you might have some bugs e.g. in the preprocessing.

Ok I’ll try! Thank you a lot!