What happens at 33 filters in Conv2d?

makslevental · June 5, 2020, 4:07pm

you can disregard the CPU traces.

this is a plot of runtime vs number of filters (n_bins) and kernel width (max_sigma) in a Conv2d layer. what happens at 33 filters? obviously a different convolution strategy is being chosen for some optimization reason but does anyone know exactly what the switch is and where it happens in the code?

albanD · June 5, 2020, 6:00pm

Hi,

I would say this is due to cudnn internals.

If you disable cudnn, you should use only our implementation and should not see this jump
cudnn has it’s custom algorithms depending on the input size
You can enable cudnn benchmark (torch.backends.cudnn.benchmark=True) and see if you see better behavior, it should pick the best algorithm and so remove such artifacts.

makslevental · June 5, 2020, 6:28pm

I would say this is due to cudnn internals.

i’m guessing there’s no way to find out what cudnn is doing because it’s closed source? or is there a way to get nvprof (or something like that) to tell me which conv strategy it’s using?

albanD · June 5, 2020, 6:53pm

I think there is a verbose mode for cudnn that gives you more info. @ptrblck ?

Hopefully using benchmark mode should remove these hiccups that we can’t control

makslevental · June 5, 2020, 7:02pm

thanks @albanD. i’m actually not that worried about this - i’m just interested to know which choices are made (and maybe why they’re made).

ptrblck · June 6, 2020, 12:19am

Yes, you can use nvprof to get the cudnn calls and check, which kernel is called for your current workload.

Adam_Dziedzic · June 8, 2020, 7:16pm

Yes, you can use nvprof to find out which convolution algorithm was used.
On your gpu machine: nvprof -o prof.nvvp python train_mnist.py
Then copy the prof.nvvp to your local machine and run: nvvp prof.nvvp
More details can be found here: https://gist.github.com/sonots/5abc0bccec2010ac69ff74788b265086

Indeed, it seem like a change in the convolution algorithm. By the way, you cannot select/check which convolution is used in PyTorch, as far as I know. Could anybody confirm that or give some more details how to do it? Moreover, there is an optimizer run by NVidia to select the most optimal convolution algorithm, and sometimes the optimizer tests a few convolutions before making the final choice, see: https://arxiv.org/pdf/1602.08124.pdf You can also control which convolution algorithm is used on the CUDA level: http://www.goldsborough.me/cuda/ml/cudnn/c++/2017/10/01/14-37-23-convolutions_with_cudnn/

albanD · June 8, 2020, 8:18pm

Yes pytorch uses the default algorithm if you do nothing. (and you can’t specify it directly).
If you set torch.backends.cudnn.deterministic it will use the default deterministic algorithm.
And if you set torch.backends.cudnn.benchmark it will try different algorithm to pick the best one.

Adam_Dziedzic · June 8, 2020, 8:51pm

To my mind, it’d be great if PyTorch supported a manual (and deterministic) selection of the convolution algorithm.

albanD · June 8, 2020, 9:20pm

@Adam_Dziedzic I think there is an issue open for that feature I am pretty sure we would be happy to accept a PR adding this !

Edit: looking at it, it is actually issue number 88, quite an old one https://github.com/pytorch/pytorch/issues/88

makslevental · June 11, 2020, 10:32pm

torch.backends.cudnn.deterministic = True

using this for some reason gives me

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

for a long running job, after a certain size Conv2d (50x80x80)

I’m running

torch==1.4.0
torchvision==0.5.0
Driver Version: 440.33.01    
CUDA Version: 10.2

on a Tesla V100.

ptrblck · June 12, 2020, 8:47am

Could you post the conv setup as well as the input, which would reproduce this error?
Also, I would recommend to update to the latest stable version (or nightly), as it’ll include the latest bug fixes (besides new features)

makslevental · June 12, 2020, 3:57pm

it’s quite hard to give you a MWE because of how involved the code is. I have a conda env and script here

github.com

makslevental/merf/blob/master/figures/run_times.py

import csv
import os
import time
from pathlib import Path

import numpy as np
import torch
from torch.backends import cudnn

cudnn.deterministic = False
from torch.utils.data import DataLoader
from skimage import io, img_as_float
from skimage.filters import gaussian
import torch.multiprocessing as mp
from nn_dog import PIN_MEMORY, DEVICE, NUM_GPUS
from nn_dog.data import SimulPLIF
from nn_dog.model import DifferenceOfGaussiansFFT, DifferenceOfGaussiansStandardConv
from sk_image.blob import cpu_blob_dog
from sk_image.enhance_contrast import stretch_composite_histogram

This file has been truncated. show original

i suspect it’s a memory leak because if i make shorter runs (i’m iterating over a set of hyperparameters) then there is no segfault. is it possible to use something like valgrind to investigate this?

also btw @albanD setting torch.backends.cudnn.deterministic = True did not actually fix the convolution strategy (i.e. force a fixed strategy); here is what i see

edit: @Adam_Dziedzic i also can’t see which cuda conv strategy is being used after running nvprof. only thing i can see is implicit_sgemm.

ptrblck · June 13, 2020, 2:58am

For cudnn.deterministic = True CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM will be used in the forward pass.

Given that the code includes multiple files, I would recommend to first disable the multiprocessing.pool, and try to remove everything unnecessary until you could narrow down the segmentation fault to a single model with some dummy data.