Best way to feed 2D tensors of varying size from 100x512 to 800x512 with a median size of 1200x512 to a vision transformer than requires to have all the inputs of the same dimension in the same batch

I have input images that are ranging anywhere from 100 to 8000 of 512x512 pixel squares. So, I feed these 512x512 patches into a resnet18 pretrained on imagenet and receive a 1D 512 tensor. Eventually, I concatenate these 1D tensors and end up with a Nx512 tensor (in an offline fashion).

Eventually, my images are converted to Nx512 tensors where N ranges from 100 to 8000 with a median of 1200 patches inside one image.

Does it even make sense to use something like this to pass these intermediate representations to the vision transformer (that I have dropped its embedding and positional encoding) since it requires me to have tensors of the same dimension for all the items in the same batch? How else I could handle this problem?

What I am doing in below code is to check if the first dimension (# patches in an image) is smaller than median of size of patches in an image across all images (1200 here), then zero pad that Nx512 tensor to convert it to 1200x512. However, if the intermediate representation has more than 1200 patches, I sample 1200 patches randomly from it. This assures me that inside one batch (e.g. of size 32), all the Nx512 tensors are of the same shape of 1200x512.

My main concern is both of these methods, especially the zero-filling are aggressive and I am concerned if they are contributing to my low-accuracy reported in Poor predictions for labels in binary classification using cross-entropy loss

        if features.shape[0] <= 1200:
            a = torch.zeros((1200 - features.shape[0], 512)) #zero padding to lenght 1200
            embeddings = torch.cat((features, a), axis=0)
            sample['image'] = embeddings
        else: 
            random_indices = torch.randint(features.shape[0], (1200, )) 
            sample['image'] = features[random_indices, :]