Index_put_ really slow on gpu

I am trying to understand why my code is 3x slower on GPU. I’m profiling and debugging my update state function, which should be straightforward.
Everything should be on the GPU, so I do not understand why placing these values would take so long. However, when looking at the profiling, we can see that the index_put_ is running from libtorch_cpu.so, which indicates that most of the function is actually performed on the CPU, even though all the values are on the GPU.

I’m new to using torch in c++, so please help me understand.

    void AAGV::update_state(double dt) {
        auto deltaTime = torch::scalar_tensor(dt, TOptions(torch::kDouble, device));
        torch::Tensor orientation = getOrientation();

        // TODO:: Very slow on gpu, OPTIMIZE THIS
        _B.index_put_({0, 0}, torch::cos(orientation) * deltaTime);
        _B.index_put_({1, 0}, torch::sin(orientation) * deltaTime);
        _B.index_put_({2, 1}, deltaTime);

//        torch::Tensor B = torch::tensor({{torch::cos(orientation) * deltaTime, 0.},
//                                         {torch::sin(orientation) * deltaTime, 0.},
//                                         {0.,                                  deltaTime}},
//                                        TOptions(torch::kDouble, device));

//        double data[] = {torch::cos(orientation).*dt, 0., torch::sin(orientation) * dt, 0., 0., dt};
//        torch::Tensor B = torch::from_blob(data, {3, 2}, torch::kDouble).to(device);


        torch::Tensor vel = torch::stack({getLinearVelocity(), getAngularVelocity()});

        torch::Tensor new_state = A.matmul(getState()) + _B.matmul(vel);
        setState(new_state);
    }

Could you describe how getOrientation is defined and how you’ve verified orientation is on the GPU?

Yes, getOrientation is just a getter indexing into a state tensor.
I only have this function to make my code more readable.

torch::Tensor getOrientation() {
            return state.index({2});
        }
torch::Tensor state = torch::zeros({3, 1}, torch::TensorOptions().dtype(torch::kDouble).device(device));

And I verify it by debuging and printing out the device.:

Orientation: 
0.01 *
-8.0727
[ CUDADoubleType{1} ]

Could it be because of my indexing is not on the GPU?
the {0, 0} part of _B.index_put_({0, 0}, torch::cos(orientation) * deltaTime);

@ptrblck Same issue in here. In my case, the index_put_ on Cuda is much slower than CPU when calculating the confusion matrix.

Here is the snip code:

    def add_batch(self, x, y):
        x_row = x.reshape(-1)
        y_row = y.reshape(-1)

        idxs = torch.stack([x_row, y_row], dim=0)

        if self.ones is None or self.last_scan_size != idxs.shape[-1]:
            self.ones = torch.ones((idxs.shape[-1]), device=x.device, dtype=torch.long)
            self.last_scan_size = idxs.shape[-1]

        self.conf_matrix = self.conf_matrix.index_put_(tuple(idxs), self.ones, accumulate=True)

With CPU version, it takes around 5 mins [02:14<03:05, 18.62it/s], while the time become 12x times slower on CUDA [00:16<1:05:09, 1.53it/s].

Since translating the tensor from CUDA to CPU results extra cost, any idea to solve this problem will be very helpful for my project.

Thanks.