How to avoid gradient vanishing in CNN

I’m trying to build a basic architecture for super-resolution (can be extended to any image processing program), but I found that the network is hardly trained. The program can be found at https://github.com/psychopa4/Basic-Architecture-for-Super-Resolution

Here is the evaluation during training after several epochs:

Scale= 8
Dataset= E:/data/DIV2K/sp_val
PSNR_predicted= 23.6633691875079
PSNR_bicubic= 23.679716005992653
It takes average 0.004361677169799805s for processing

As you can see above, the output of the network is basically the Bicubic upsampling counterparts of the low-resolution images, even worse. I have tried training the network for hundreds of epochs, however, the output tend to converge to the Bicubic upsampling version.

I used to build my model on TensorFlow and it worked well, so I wonder whether I have missed something in the training process with PyTorch. I think it’s due to gradient vanishing, because the network is hardly trained after 3 or 4 epochs. I tried torch.nn.utils.clip_grad_norm but i didn’t work, maybe it’s more suitable for gradient exploding rather than vanishing.

Sorry I didn’t insert code directly, cause I have no idea where the crucial part is.I’d really appreciate it if you could take a few seconds at this program!

About Super-Resolution
Normally, model for super-resolution adopts LR image to reconstruct HR image.

The output of the network is F(I_lr)+I_bic, and the loss function is L1(HR,out), so the output should be close to HR instead of Bicubic.

This ‘+Bicubic’ strategy is universally adopted for super-resolution.

how do you deal with this problem? I have met same problem.