How cudnn.deterministc actually works?

Recently we’ ve been working on storing the cache of benchmark and deterministic.
We test on a inside detection model, whose input shape varies a lot.

Testing Environment:

  • pytorch 1.8.1 and cuda 11.2 on Tesla PG503-216.
  • seeds are fixed to 0

we ran (benchmark=False, deterministic=False) and (benchmark=True, deterministic=False) and analyzed the saved cache, (key,value) pairs stored in BenchmarkCace struct.

During our test, we observe:

  1. we ran (benchmark=False, deterministic=False) twice, for same ConvolutionParams we get the same conv algorithm, however (benchmark=True, deterministic=False) gives different results.

  2. It seems not definite that the algorithm chosen by cudnnFindConvolutionForwardAlgorithmEx give better results than cudnnGetConvolutionForwardAlgorithm_v7.
    here we load the cache at the beginning, the algorithm id is directly provided.

So we are wondering:

  1. how cudnnGetConvolutionForwardAlgorithm_v7 actually works, what kind of information it utilizes to offer the algorithm, we noticed that the perf struct’s time = -1 (e.g. cudnnConvolutionFwdAlgoPerf_t)
  2. the time which has actually value inside the perf struct returned by cudnnFindConvolutionForwardAlgorithmEx is measured.
  1. benchmark=True will profile kernels and select the fastest one, which might not be deterministic as it would depend on your system load. benchmark=False will use internal heuristics and try to select the fastest kernel given the setup (input shape, conv arguments, dtype etc.). deterministic=True will force cuDNN to select deterministic kernels (i.e. their computation is deterministic).

  2. The returned time is created internally in cudnnFind during the kernel profiling.

Hi @ptrblck , thanks
for you reply. During jiangzz’s test, we found in detection model, determinstic=False behaves on par with benchmark=True, so we are wondering why deterministic performs so well? Before that we believe benchmark have more algorithms to choose and should found the fastest one.

I don’t know what your exact use case is, but did you check which actual kernel is being executed?
Is the deterministic one the fastest or is benchmark=True selecting another one with the same execution speed?
Deterministic kernels are not forced to be slower by default, but can decrease the performance.

During Test, we compare the Benchmark cache of cudnn operator selected by cudnnGetConvolutionForwardAlgorithm_v7 and cudnnFindConvolutionForwardAlgorithmEx, some of them are different, some are same.
Among the different choices, some operators selected by cudnnFindConvolutionForwardAlgorithmEx are not faster.
We did the time test like this:

struct TimeMeasure{
  float _t;
  cudaEvent_t start, stop;
  TimeMeasure(){
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
  }
  ~TimeMeasure(){
    cudaEventDestroy(start);
	  cudaEventDestroy(stop);
  }
};
TimeMeasure tm;
if(flag){
        cudaEventRecord(tm->start);
      }
      AT_CUDNN_CHECK_WITH_SHAPES(cudnnConvolutionForward(
          args.handle,
          &one, args.idesc.desc(), input.data_ptr(),
          args.wdesc.desc(), weight.data_ptr(),
          args.cdesc.desc(), fwdAlgPerf.algo, workspace.data_ptr(), fwdAlgPerf.memory,
          &zero, args.odesc.desc(), output.data_ptr()),
        args, "Forward algorithm: ", static_cast<int>(fwdAlgPerf.algo), "\n");
      if(flag){
        cudaEventRecord(tm->stop);
        cudaEventSynchronize(tm->stop);
        float time;
        cudaEventElapsedTime(&time, tm->start, tm->stop);
        tm->_t = time;
}