What is aten::fill_ meaning in ATen APIs?

fishmingyu · August 5, 2021, 5:22pm

I profile a model using the torch.profiler, only to find I am confused with the aten::fill_ API. Since there is no specific documentation about ATen, I couldn’t make sense of the original code in the following URL.

github.com

pytorch/pytorch/blob/87465a6e68b62fe9a10482b98f42b48c71aacb68/aten/src/ATen/native/Fill.cpp#L35

    
      
            auto iter = TensorIteratorConfig()
              .set_check_mem_overlap(false)  // Fill is idempotent, so overlap is okay
              .check_all_same_dtype(false)
              .add_output(self)
              .resize_outputs(false)
              .build();
            fill_stub(iter.device_type(), iter, value);
            return self;
          }
          
          
Tensor& fill_(Tensor& self, const Scalar& value) {
            return fill_out(self, value);
          }
          
          
Tensor& fill_(Tensor& self, const Tensor& value) {
            TORCH_CHECK(value.dim() == 0, "fill_ only supports 0-dimension value tensor but got tensor with ", value.dim(), " dimensions.");
            return fill_out(self, value.item());
          }
          
          
Tensor& fill_meta_(Tensor& self, const Scalar& value) {
            return self;

What’s the difference between aten::fill_ on GPU and CPU? Compared to GPU, it seems that aten::fill_ costs much more time while running the same model on CPU.

If anyone is familiar with ATen and its APIs, please show me the points of their working mechanism.

eqy · August 6, 2021, 9:03pm

fill_ looks to be a typical TensorIterator (roughly summarized, a highly templated way to implement elmentwise operations) op pytorch/Fill.cpp at a9b0a921d592b328e7e80a436ef065dadda5f01b · pytorch/pytorch · GitHub. TensorIterator ops are usually very generic so there likely isn’t anything that stands out vs. other elementwise ops.

Can you describe how you are timing the relative time cost of fill_ on CPU and GPU? It is expected that the first invocation of fill may be slow depending on when timing starts/how the CUDA context is created, but I would be surprised if it remains slower when filling large tensors (due to the typical difference in GPU/CPU memory bandwidth).