So I have a self supervised Siamese net for which I have saved the train and test feature vectors for each input. As input, I take two human tracks (so cropped bounding box régions from a video, and output their interaction label 1 or 0).
Now, I want to perform a downstream evaluation task for human interaction recognition. How should I go about this when I have a pair of inputs? I just want to create a MLP network that learns to classify the interaction classes in a fully supervised setting.