For measure performance data of a customized C++ function, should RECORD_FUNCTION macro be used or follow steps in Dispatcher::callWithDispatchKeySlowPath function?
- It looks like it is better to use at::shouldRunRecordFunction(&pre_sampled) to determine whether to add a RecordFunction into execution to reduce overhead in non-profiling scenario. Is there any reason at::shouldRunRecordFunction(&pre_sampled) is not invoked in the macro?
- pre_sampled parameter seems to be not working in RECORD_FUNCTION. Will pre_sampled bring benefits to performance?
- How RECORD_FUNCTION macro deal with op nesting invocation scenario?