What happens at 33 filters in Conv2d?

you can disregard the CPU traces.

this is a plot of runtime vs number of filters (n_bins) and kernel width (max_sigma) in a Conv2d layer. what happens at 33 filters? obviously a different convolution strategy is being chosen for some optimization reason but does anyone know exactly what the switch is and where it happens in the code?

1 Like

Hi,

I would say this is due to cudnn internals.

  • If you disable cudnn, you should use only our implementation and should not see this jump
  • cudnn has it’s custom algorithms depending on the input size
  • You can enable cudnn benchmark (torch.backends.cudnn.benchmark=True) and see if you see better behavior, it should pick the best algorithm and so remove such artifacts.

I would say this is due to cudnn internals.

i’m guessing there’s no way to find out what cudnn is doing because it’s closed source? or is there a way to get nvprof (or something like that) to tell me which conv strategy it’s using?

I think there is a verbose mode for cudnn that gives you more info. @ptrblck ?

Hopefully using benchmark mode should remove these hiccups that we can’t control :slight_smile:

thanks @albanD. i’m actually not that worried about this - i’m just interested to know which choices are made (and maybe why they’re made).

Yes, you can use nvprof to get the cudnn calls and check, which kernel is called for your current workload.

1 Like

Yes, you can use nvprof to find out which convolution algorithm was used.
On your gpu machine: nvprof -o prof.nvvp python train_mnist.py
Then copy the prof.nvvp to your local machine and run: nvvp prof.nvvp
More details can be found here: https://gist.github.com/sonots/5abc0bccec2010ac69ff74788b265086

Indeed, it seem like a change in the convolution algorithm. By the way, you cannot select/check which convolution is used in PyTorch, as far as I know. Could anybody confirm that or give some more details how to do it? Moreover, there is an optimizer run by NVidia to select the most optimal convolution algorithm, and sometimes the optimizer tests a few convolutions before making the final choice, see: https://arxiv.org/pdf/1602.08124.pdf You can also control which convolution algorithm is used on the CUDA level: http://www.goldsborough.me/cuda/ml/cudnn/c++/2017/10/01/14-37-23-convolutions_with_cudnn/

1 Like

Yes pytorch uses the default algorithm if you do nothing. (and you can’t specify it directly).
If you set torch.backends.cudnn.deterministic it will use the default deterministic algorithm.
And if you set torch.backends.cudnn.benchmark it will try different algorithm to pick the best one.

To my mind, it’d be great if PyTorch supported a manual (and deterministic) selection of the convolution algorithm.

@Adam_Dziedzic I think there is an issue open for that feature :wink: I am pretty sure we would be happy to accept a PR adding this !

Edit: looking at it, it is actually issue number 88, quite an old one :stuck_out_tongue: https://github.com/pytorch/pytorch/issues/88

torch.backends.cudnn.deterministic = True

using this for some reason gives me

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

for a long running job, after a certain size Conv2d (50x80x80)

I’m running

torch==1.4.0
torchvision==0.5.0
Driver Version: 440.33.01    
CUDA Version: 10.2  

on a Tesla V100.

Could you post the conv setup as well as the input, which would reproduce this error?
Also, I would recommend to update to the latest stable version (or nightly), as it’ll include the latest bug fixes (besides new features) :wink:

it’s quite hard to give you a MWE because of how involved the code is. I have a conda env and script here

i suspect it’s a memory leak because if i make shorter runs (i’m iterating over a set of hyperparameters) then there is no segfault. is it possible to use something like valgrind to investigate this?

also btw @albanD setting torch.backends.cudnn.deterministic = True did not actually fix the convolution strategy (i.e. force a fixed strategy); here is what i see

image

edit: @Adam_Dziedzic i also can’t see which cuda conv strategy is being used after running nvprof. only thing i can see is implicit_sgemm.

For cudnn.deterministic = True CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM will be used in the forward pass.

Given that the code includes multiple files, I would recommend to first disable the multiprocessing.pool, and try to remove everything unnecessary until you could narrow down the segmentation fault to a single model with some dummy data.