Index_put_ really slow on gpu

I am trying to understand why my code is 3x slower on GPU. I’m profiling and debugging my update state function, which should be straightforward.
Everything should be on the GPU, so I do not understand why placing these values would take so long. However, when looking at the profiling, we can see that the index_put_ is running from, which indicates that most of the function is actually performed on the CPU, even though all the values are on the GPU.

I’m new to using torch in c++, so please help me understand.

    void AAGV::update_state(double dt) {
        auto deltaTime = torch::scalar_tensor(dt, TOptions(torch::kDouble, device));
        torch::Tensor orientation = getOrientation();

        // TODO:: Very slow on gpu, OPTIMIZE THIS
        _B.index_put_({0, 0}, torch::cos(orientation) * deltaTime);
        _B.index_put_({1, 0}, torch::sin(orientation) * deltaTime);
        _B.index_put_({2, 1}, deltaTime);

//        torch::Tensor B = torch::tensor({{torch::cos(orientation) * deltaTime, 0.},
//                                         {torch::sin(orientation) * deltaTime, 0.},
//                                         {0.,                                  deltaTime}},
//                                        TOptions(torch::kDouble, device));

//        double data[] = {torch::cos(orientation).*dt, 0., torch::sin(orientation) * dt, 0., 0., dt};
//        torch::Tensor B = torch::from_blob(data, {3, 2}, torch::kDouble).to(device);

        torch::Tensor vel = torch::stack({getLinearVelocity(), getAngularVelocity()});

        torch::Tensor new_state = A.matmul(getState()) + _B.matmul(vel);

Could you describe how getOrientation is defined and how you’ve verified orientation is on the GPU?

Yes, getOrientation is just a getter indexing into a state tensor.
I only have this function to make my code more readable.

torch::Tensor getOrientation() {
            return state.index({2});
torch::Tensor state = torch::zeros({3, 1}, torch::TensorOptions().dtype(torch::kDouble).device(device));

And I verify it by debuging and printing out the device.:

0.01 *
[ CUDADoubleType{1} ]

Could it be because of my indexing is not on the GPU?
the {0, 0} part of _B.index_put_({0, 0}, torch::cos(orientation) * deltaTime);

@ptrblck Same issue in here. In my case, the index_put_ on Cuda is much slower than CPU when calculating the confusion matrix.

Here is the snip code:

    def add_batch(self, x, y):
        x_row = x.reshape(-1)
        y_row = y.reshape(-1)

        idxs = torch.stack([x_row, y_row], dim=0)

        if self.ones is None or self.last_scan_size != idxs.shape[-1]:
            self.ones = torch.ones((idxs.shape[-1]), device=x.device, dtype=torch.long)
            self.last_scan_size = idxs.shape[-1]

        self.conf_matrix = self.conf_matrix.index_put_(tuple(idxs), self.ones, accumulate=True)

With CPU version, it takes around 5 mins [02:14<03:05, 18.62it/s], while the time become 12x times slower on CUDA [00:16<1:05:09, 1.53it/s].

Since translating the tensor from CUDA to CPU results extra cost, any idea to solve this problem will be very helpful for my project.
