0.3 _cudnn_convolution_ deterministic flag

Danlu_Chan · December 28, 2017, 12:56am

Hi,

I am maintaining the efficient DenseNet codebase. I found pytorch 0.3 changed the API a little bit. For _cudnn_convolution_*, we have to pass a deterministic flag for conv op. I made the changes accordingly. See this branch. It is runnable under 0.3, but the result is wrong (error rate ~= 0.9 all the time).

checkout the pytorch0.3 branch and run CUDA_VISIBLE_DEVICES=0 python demo.py --efficient True --data ./data should reproduce the problem.

Can anyone help on this issue?

Thanks,
Danlu

ezyang · January 4, 2018, 10:19pm

I am going to take a look.

ezyang · January 4, 2018, 11:45pm

What’s the last known good version of PyTorch for which CUDA_VISIBLE_DEVICES=0 python demo.py --efficient True --data ./data on master converges, and how quickly does it converge? It is possible I have messed up but master doesn’t seem to converge for me on 0.2 (e02f7bf8a344b04f51527f233bd4727b6ebb1ebe) either.

Danlu_Chan · January 5, 2018, 12:08am

Thanks for your help!

PyTorch 0.1.12 should pass all the test.

According to my experiment, PyTorch 0.2 works well in terms of the final accuracy, but it does not pass one of the tests.

ezyang · January 5, 2018, 4:57pm

That’s interesting. I’ve run 167 epochs of the 1 GPU test on PyTorch 0.2, but the error is still at 0.896. (Do you mean something different in terms of final accuracy?) I did confirm that it converges on 0.1.12. So it sounds like the behavior change dat some point between 0.1 and 0.2.

Danlu_Chan · January 5, 2018, 5:35pm

I also believe something changed between 0.1.12 and 0.2 since we cannot pass all the test after upgrading to PyTorch 0.2. And I think you’re right. Just created a new conda environment and installed 0.2 from scratch. It did not converge. My bad!

However, I asked Trevor Killeen about convnd API between 0.1.12 and 0.2 and he replied that:

I don’t think that the CuDNN calls should have changed but I don’t have as much context to the switch from running Conv in Python to the C++ autograd implementation. Someone in one of the above channels can probably provide better help.

ezyang · January 8, 2018, 4:08pm

Yeah, I’m not too familiar with the 0.2 release so I’ll have to do some digging. Would be worth figuring this out though.

nikostr · January 22, 2018, 1:06pm

Has there been any progress on this? Any recommendation on where to start looking to hunt down this bug?

Danlu_Chan · January 22, 2018, 9:00pm

Any updates? I also have time to investigate now.

ezyang · January 23, 2018, 6:48pm

Sorry, I haven’t had a chance to attempt a bisect between 0.1 and 0.2 to see what might have changed. An easier thing to check that might be illuminating is to see if the problem repros (1) with cuDNN turned off (torch.backends.cudnn.enabled = False) and (2) with CPU.

MatthewKleinsmith · February 3, 2018, 2:00am

torch.backends.cudnn.enabled = True:

Stuck at 0.89; 80 epochs.

torch.backends.cudnn.enabled = False:

“Exception: You must be using CUDNN to use _EfficientBatchNorm”
cudnn calls are made throughout densenet_efficient.py

Environment:

nvidia-docker, Ubuntu 14.04, CUDA 8.0, cuDNN 6, PyTorch 0.2.0

Details:

Base environment:

docker run --runtime=nvidia --init -it --rm --ipc=host mwksmith/efficient_densenet_pytorch

Here’s the Dockerfile. I uploaded the Docker image to Docker Hub at mwksmith/efficient_densenet_pytorch.

Environment for cudnn = True:
Titan X Pascal, Driver Version: 384.111

Command:
CUDA_VISIBLE_DEVICES=1 python demo.py

Environment for cudnn = False:

1080 Ti, Driver Version: 387.34, different machine

I put torch.backends.cudnn.enabled = False after the import statements in demo.py.

Command:

CUDA_VISIBLE_DEVICES=0 python demo.py

@Danlu_Chan, @nikostr, @ezyang

ezyang · February 15, 2018, 7:53am

I still plan on looking into this, but at this point I am seriously considering just reimplementing efficient DenseNet from scratch on PyTorch HEAD and seeing if it works or not (major porting seems necessary, since a lot of the internal APIs DenseNet are using have been moved around and are no longer available).

Danlu_Chan · February 16, 2018, 2:30am

Thanks for your hard work!
I am curious how you would implement this. Roughly I have one question: no matter what kind of approach you’re going to implement, I guess you would need to manually allocate the memory for the several op. I am afraid this really violates the design of PyTorch…

ezyang · February 16, 2018, 7:33am

Yeah. My thinking is that we might be able to do it if we expose enough out= variants of operators.

Audrius-St · February 26, 2018, 6:31pm

Very interested in the outcome of this effort.
Currently using “conventional” DenseNet and would like to increase model size.
Thank you all involved for your work.