What is aten::fill_ meaning in ATen APIs?

I profile a model using the torch.profiler, only to find I am confused with the aten::fill_ API. Since there is no specific documentation about ATen, I couldn’t make sense of the original code in the following URL.

What’s the difference between aten::fill_ on GPU and CPU? Compared to GPU, it seems that aten::fill_ costs much more time while running the same model on CPU.

If anyone is familiar with ATen and its APIs, please show me the points of their working mechanism.

fill_ looks to be a typical TensorIterator (roughly summarized, a highly templated way to implement elmentwise operations) op pytorch/Fill.cpp at a9b0a921d592b328e7e80a436ef065dadda5f01b · pytorch/pytorch · GitHub. TensorIterator ops are usually very generic so there likely isn’t anything that stands out vs. other elementwise ops.

Can you describe how you are timing the relative time cost of fill_ on CPU and GPU? It is expected that the first invocation of fill may be slow depending on when timing starts/how the CUDA context is created, but I would be surprised if it remains slower when filling large tensors (due to the typical difference in GPU/CPU memory bandwidth).