Slowdown in CUDA graph execution using cudaMemsetAsync (vs. a fill kernel)

This is an interesting observation as this topic claims the opposite at least for eager execution.

Could you post your full extension code to reproduce the issue?