Why embedding_bag result is different between GPU and CPU impl

when indices and offset are all 2D,

CPU:

GPU :

Could you post a minimal and executable code snippet by wrapping them into three backticks ``` so we could try to reproduce and debug the issue?