How to debug the backward process of nn.DataParallel

I implemented memory-efficient DenseNet for PyTorch v0.4.0 and it works fine for single GPU. However, it fails in the multi-GPUs case. The error occurs during the backward process of nn.DataParallel model:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

I guess the modifications of intermediate variables with shared storage cause this error, but I cannot locate the inplace operation since all the inplace operations work fine for single-gpu case. Does nn.DataParallel do additional gradient checking when backward?
Besides, I run the similar implementation with Pytorch v0.3.1 and it works well for nn.DataParallel.
I opened an issue for the project. The code can be tested.

I fix the bug strangely by no longer restoring the running_mean and running_var for backward.
The in-place operation like self.running_mean.copy_(self.prev_running_mean) when backward cannot pass the nn.DataParallel case.