Index_put_ really slow on gpu

I am trying to understand why my code is 3x slower on GPU. I’m profiling and debugging my update state function, which should be straightforward.
Everything should be on the GPU, so I do not understand why placing these values would take so long. However, when looking at the profiling, we can see that the index_put_ is running from, which indicates that most of the function is actually performed on the CPU, even though all the values are on the GPU.

I’m new to using torch in c++, so please help me understand.

    void AAGV::update_state(double dt) {
        auto deltaTime = torch::scalar_tensor(dt, TOptions(torch::kDouble, device));
        torch::Tensor orientation = getOrientation();

        // TODO:: Very slow on gpu, OPTIMIZE THIS
        _B.index_put_({0, 0}, torch::cos(orientation) * deltaTime);
        _B.index_put_({1, 0}, torch::sin(orientation) * deltaTime);
        _B.index_put_({2, 1}, deltaTime);

//        torch::Tensor B = torch::tensor({{torch::cos(orientation) * deltaTime, 0.},
//                                         {torch::sin(orientation) * deltaTime, 0.},
//                                         {0.,                                  deltaTime}},
//                                        TOptions(torch::kDouble, device));

//        double data[] = {torch::cos(orientation).*dt, 0., torch::sin(orientation) * dt, 0., 0., dt};
//        torch::Tensor B = torch::from_blob(data, {3, 2}, torch::kDouble).to(device);

        torch::Tensor vel = torch::stack({getLinearVelocity(), getAngularVelocity()});

        torch::Tensor new_state = A.matmul(getState()) + _B.matmul(vel);

Could you describe how getOrientation is defined and how you’ve verified orientation is on the GPU?

Yes, getOrientation is just a getter indexing into a state tensor.
I only have this function to make my code more readable.

torch::Tensor getOrientation() {
            return state.index({2});
torch::Tensor state = torch::zeros({3, 1}, torch::TensorOptions().dtype(torch::kDouble).device(device));

And I verify it by debuging and printing out the device.:

0.01 *
[ CUDADoubleType{1} ]

Could it be because of my indexing is not on the GPU?
the {0, 0} part of _B.index_put_({0, 0}, torch::cos(orientation) * deltaTime);