How to divide trained model's structure into multi-gpu, not dataparallel

I have trained a large image denoising CNN with patch size 40x40. But when I want to test image, whose resolution is 512x512, I got a cuda runtime error: out of memory. Thus I think I need to divide the structure into multi-GPU, and then restore the parameters. I know .cuda(gpu_num) can put layers into different gpu, but how can I restore the pretrained parameters?