Variable Sequnce Image and MetaData

Hello Everyonve,
Currently, I’m reading a paper and trying to implement it myself since the authors haven’t provided the code. This implementation is focused on melanoma recognition, and specifically, I’m working on building a classification system that can operate on a per-patient basis. For example, if my batch size is 4, each sample in the batch has a different sequence length. We are not only using images but also incorporating patient metadata. Therefore, when feeding the data into the encoder of the transformer, I have added padding and provided the mask accordingly.

My question is: when it comes to classification, I want to perform classification based on the original sequence lengths, which vary within the batch. Is there any way to achieve this? Any help would be greatly appreciated

Here is the github repository i had created:-GitHub Repo