Hey,

I am running into problems I am not able to comprehend. Maybe you can help me out.

I am working on a C++ extension and reimplementing some functions for performance.

In one of them I am trying to iterate through a 1-dimensional tensor and access each element in it.

```
torch::Tensor unpack(const torch::Tensor &self) {
TORCH_CHECK(self.dim() == 1, "Only one dimensional tensor allowed");
TORCH_CHECK(self.is_contiguous(), "Tensor has to be conitguous");
TORCH_CHECK(self.dtype() == torch::kInt64, "Tensor has to be long type");
// Get pointer to data
auto *self_ptr = self.data_ptr<int64_t>();
// Access data
// Crashed with GPU
std::cout << *self_ptr << "\n"; // index 0
std::cout << *self_ptr++ << "\n"; // index 1
...
}
```

Sample input:

```
auto opts = torch::TensorOptions().dtype(torch::kInt64);
auto tensor = torch::ones({4}, opts);
unpack(tensor);
```

This all works fine when the input tensor is on the CPU, BUT if I try to pass a tensor that is on the GPU I am running into a segmentation fault. Note that the tensor is in the end passed from the python program, if that is relevant.

UPDATE:

I tested with moving the tensor back on the cpu and this works. Which is a workaround…

Is there a way of accessing the data of the GPU tensor without the overhead off moving it?

In hope someone can help me out here.

Greetings,

Patrick