I profile a model using the torch.profiler, only to find I am confused with the aten::fill_ API. Since there is no specific documentation about ATen, I couldn’t make sense of the original code in the following URL.
What’s the difference between aten::fill_ on GPU and CPU? Compared to GPU, it seems that aten::fill_ costs much more time while running the same model on CPU.
If anyone is familiar with ATen and its APIs, please show me the points of their working mechanism.
Can you describe how you are timing the relative time cost of fill_ on CPU and GPU? It is expected that the first invocation of fill may be slow depending on when timing starts/how the CUDA context is created, but I would be surprised if it remains slower when filling large tensors (due to the typical difference in GPU/CPU memory bandwidth).