Feature extraction for similarity learning

i am training a model to perform feature extraction on images and then i am performing clustering on those features to group the images based on visual similarity.
i am using a resnet-50 model, removed the final classification layer and added a custom L2 normalisation layer, the output of the L2 normalisation layer is what i am using to perform clustering on, the final layers look like this:
-----previous layers-----
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Sequential(
(0): L2Normalization()
)
)

i want to know if i should add a linear layer or multiple linear layers inside a sequential container or should i not use any linear layers at all ?
could someone please comment on where could i take the output from or should i avoid or use linear layers and/or activation functions ?