Performance on AMD GPUs

Hi,
I have collected performance data on MI250X (single GCD) and MI300 AMD GPUs. I see a significant slow down in the following kernels compared to MI250X. I am not at all familiar with the PyTorch source. I would like some help understanding the source (i.e. how the specific kernel is launched), so I can better understand the performance issue.

  1. void at::native::_scatter_gather_elementwise_kernel<256, 4, at::native::_cuda_scatter_gather_internal_kernel<true, c10::Half>::operator()at::native::ReduceAdd(at::TensorIterator&, long, long, long, at::native::ReduceAdd const&)::{lambda(int)#1}>(int, at::native::_cuda_scatter_gather_internal_kernel<true, c10::Half>::operator()at::native::ReduceAdd(at::TensorIterator&, long, long, long, at::native::ReduceAdd const&)::{lambda(int)#1})

  2. void at::native::indexFuncLargeIndex<c10::Half, long, unsigned int, 2, 2, -2, true, at::native::(anonymous namespace)::ReduceAdd>(at::cuda::detail::TensorInfo<c10::Half, unsigned int>, at::cuda::detail::TensorInfo<c10::Half, unsigned int>, at::cuda::detail::TensorInfo<long, unsigned int>, int, int, unsigned int, unsigned int, long, long, at::native::(anonymous namespace)::ReduceAdd const&, c10::Half)

I observe the issue with the 2.2 release of ROCm branch of PyTorch, i.e.

I found the source for code for the scatter_gather. My question, to those familiar, is does this operation absolutely require atomics? Are there alternative implementations (prefix scan based perhaps) that could be used in place?