Splitting model layers across GPUs

Hey Guys,
I want to run an llm model in the most efficient and fast way possible.

I already tried splitting up the layers across my 2 GPUs (Tesla M40) but I actually want to split each layer in half. So that per layer one GPU gets 1/2 Layer and the other GPU gets another Half and then sync after every layer.

From my research FSDP sounds good but does someone have experience to use it without a backward_pass? Because I dont want to train the model yet, I only want to run it and answer questions.

Thanks!