How to combine pre-trained weights of components from different multimodal LLMs?

Hi everyone,

I’m currently doing some research on multimodal LLMs and as you know, an MLLM has multiple models from vision, text, and speech, etc. in it. Most MLLMs have vision/ audio encoder(s) to extract features from images, videos, and audio, and some connection modules to adapt the features to the embedding space of the LLM, and usually only the connection modules are trained, with the pre-trained encoders and LLM frozen.

I’m trying to combine some components (e.g. vision encoders) from an MLLM with the architecture of another MLLM, and these MLLMs have their weights stored in safetensors files of Hugging face repos. So my question is, do we have a way to inspect those safetensors files to know which sets of weights correspond to which components in the MLLM? And a possibly more difficult question is that can we combine parts of the weights from this MLLM’s safetensors (e.g. only the vision encoder’s weights) to the weights in safetensors of the other MLLM?

Thanks in advance.