cuDNN exception during engine config selection

F10w101 · January 10, 2024, 6:48am

Note: My understanding of cuDNN internals are limited, so if you realize that my question solely stems from wrong udnerstanding, please point it out so I can learn.

In aten/src/ATen/native/cudnn/Conv_v8.cpp (https://github.com/pytorch/pytorch/blob/19e93b85b91dba5c2d04e97f82ec764f404a4c10/aten/src/ATen/native/cudnn/Conv_v8.cpp#L564) the try_configs function tries to create a cudnn_frontend engine, catching exceptions when a certain config fails. Reading the code around that function and the cuDNN docs these exceptions are of no concern as long as one Engine can be created succesfully.

These exceptions however always show up in my output window in visual studio and my debugger always breaks at those. I am also wondering about the design choice in general, as there seems to be a cuDNN internal function that does the same job, but probably doesn’t throw any exceptions internally. From the docs:

You can “auto-tune”, that is, iterate over the list and time for each engine config and choose the best one for a particular problem on a particular device. The cuDNN frontend API provides a convenient function, cudnnFindPlan(), which does this.

Is there

a) A reason why choosing the first working config from the list was implemented in this form?
b) A way to hide this exception from showing up in visual studios output window? I like to keep it clean to be able to pinpoint actual errors more quickly

ptrblck · January 10, 2024, 2:57pm

Could you show what your Visual Studio output prints for these exceptions, please?
CC @eqy

F10w101 · January 12, 2024, 2:30pm

I get

Exception thrown at 0x00007FFA6AC3CF19 in foo.exe: Microsoft C++ exception: cudnn_frontend::cudnnException at memory location 0x00000049E82EE7D0.

it is not very descriptive in itself. With the debugger I could see that it is thrown from pytorch\third_party\cudnn_frontend\include\cudnn_frontend_utils.h in set_error_and_throw_exception which itself is called from pytorch\third_party\cudnn_frontend\include\cudnn_frontend_ExecutionPlan.h in the lines (

if (status != CUDNN_STATUS_SUCCESS) {
            set_error_and_throw_exception(
                &m_execution_plan, status, "CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed");
            return std::move(m_execution_plan);

In pytorch, the calling function is try_configs in pytorch\aten\src\ATen\native\cudnn\Conv_v8.cpp which catches and ignores the exception:

bool try_configs(cudnn_frontend::EngineConfigList& configs, const std::string& opgraph_tag, const CacheKey& key, const cudnnHandle_t handle, const Tensor& x, const Tensor& y, const Tensor& w) {
  for (auto & config : configs) {
    try {
      auto plan = cudnn_frontend::ExecutionPlanBuilder()
                    .setHandle(handle)
                    .setEngineConfig(config, opgraph_tag)
                    .build();
      if (plan_errata_exception(handle, plan.getTag())) {
        continue;
      }
      run_conv_plan(handle, x, y, w, plan);
      benchmark_cache.emplace(key, plan);
      return true;
    } catch (cudnn_frontend::cudnnException &e) {} catch(CuDNNError &e) {}
      catch (c10::OutOfMemoryError &e) {
        cudaGetLastError(); // clear CUDA error
    }
  }
  return false;
}

eqy · February 23, 2024, 6:54pm

Sorry for the delayed response, but for a) my understanding is that this is a historical limitation of cuDNN where the guarantee of a given EngineConfig working after being returned by heuristics is not super strong, and this exception handling until the first working config is found is just a CYA solution (which unfortunately still seems to be needed based on your experience).