Given the tensor which specifies the sizees of block matrices, I am trying to fill the block diagonal parts with specific value. For instance, let the size of each block to be `size_tensor`

```
size_tensor = torch.tensor([2,1,3], device='cuda')
```

The output I want to get looks like this:

```
output =
[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1]
```

As you can see, we fill block diagonal part with size defined in `size_tensor`

with specific value (in this case `1`

). Note that the size of each block is not same. My question is, what would be the most efficient way to perform this operation?

Currently, my code looks as follows:

```
block_components = [torch.full((x,x), 1, device='cuda') for x in size_tensor]
output = torch.block_diag(*block_components)
```

However, this seems a bit slow when I’m working on GPU device. In my case, the `size_tensor`

actually comes from some previous operations on GPU, so it lives on GPU device. Since iterating tensors on GPU is slow, if I change above code to

```
block_components = [torch.full((x,x), 1, device='cuda') for x in size_tensor.to('cpu')]
output = torch.block_diag(*block_components)
```

The code runs a bit faster. Howver, this code still runs a operation which sends GPU tensor (`size_tensor`

) to CPU, so this seems not an optimal way to do it.

So my question is, what would be the most efficient way of creating block diagonal matrix in this situation?