Help debugging DenseNet model on CIFAR-10

bamos · February 9, 2017, 3:42pm

Hi PyTorch community,

I strongly dislike asking for help on things like a model not converging, but I have implemented a DenseNet model in PyTorch and do not know how to further debug why it’s not working. It’s very likely that I’ve overlooked something simple, but I’m starting to think there might be something deeper going on with PyTorch. I’ve been checking gradients and my training procedure and everything I can think to check looks surprisingly good. Can somebody that’s familiar with PyTorch or training DenseNets take a quick look at my project so far and let me know if anything looks wrong or if there’s anything else I should check? I have posted it to GitHub at https://github.com/bamos/densenet.pytorch along with some more details about what I’ve checked and instructions on running the code.

-Brandon.

apaszke · February 9, 2017, 4:07pm

One thing is that this:

for param_group in optimizer.state_dict()['param_groups']:

should be replaced with that:

for param_group in optimizer.param_groups:

I know that the first version appears in the ImageNet example, but it no longer works as expected. I can’t see anything wront at a glance, but I’ll try to look more carefully into it sometime.

bamos · February 9, 2017, 4:25pm

Thanks for quickly looking! I fixed that. (It doesn’t sound like you expected this to help the convergence – the model’s still not converging as before.)

bamos · February 9, 2017, 4:41pm

Thanks!

My training code does successfully train a known model (VGGnet) on CIFAR-10, so my logic is also that there’s something wrong with my model. However I have also exactly compared my model’s outputs and gradients to the official model.

I thought about trying to remove the passthrough connections from my DenseNet implementation to further debug this. However I haven’t tried this yet because I’ve never seen reports of such an architecture (even correctly implemented) converging on CIFAR-10. And if my implementation of this did converge, then it would indicate that there’s a problem of layers that concatenate the input and output. So to directly check if there’s a problem with this kind of operation, I used numdifftools to numerically check the gradients of a single PyTorch layer that concatenated the input to a fully-connected operation.

As another idea of breaking the DenseNet into a known architecture, I could start with a ResNet architecture that’s known to converge and then start adding DenseNet features. However these intermediate architecture are not known to converge, so if it doesn’t work, then I won’t know if it’s because of a code bug or something more fundamental.

bamos · February 9, 2017, 7:44pm

Adam’s been helping me debug this over Slack today and we’ve solved it! We found a new PyTorch bug with cudnn that comes up with DenseNet-style layers. After fixing this, my DenseNet model converges much better than before, and I’ll update my repo with the current results shortly. My understanding is that Adam will push a short patch to PyTorch master soon.

Thanks again for the help, Adam.

-Brandon.

apaszke · February 9, 2017, 11:26pm

The fix is already in master.

ClementPinard · February 10, 2017, 10:24am

What was the problem ? Been having some convergence issues with my network which has some layers concatenation just as yours. My problems are not necessarily related to yours, but I’m glad you could figure it out with adam !

apaszke · February 10, 2017, 2:16pm

The problem arised form concatenating outputs of convolution along second dimension. This lead to calling conv’s backward with non-contiguous gradients, and we were overly smart about reusing cuDNN descriptors, so the backend thought that it’s contiguous. The fix is to either disable cuDNN or rebuild pytorch.

I’m sorry if you that also affected your network

fmassa · March 3, 2017, 11:51am

Some tensor operations, like transpose, do not perform memory copy and only change the strides of the tensor (and pixel shuffle is one of this cases)
tensor.contiguous() will make sure that the data is contiguous in memory, and if not will perform a copy so that it becomes contiguous. This is necessary for some functions to work (for example, .view only accepts contiguous tensors).