Filling block diagonal part with varying size

Given the tensor which specifies the sizees of block matrices, I am trying to fill the block diagonal parts with specific value. For instance, let the size of each block to be size_tensor

size_tensor = torch.tensor([2,1,3], device='cuda')

The output I want to get looks like this:

output = 
[1, 1, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 1, 1]

As you can see, we fill block diagonal part with size defined in size_tensor with specific value (in this case 1). Note that the size of each block is not same. My question is, what would be the most efficient way to perform this operation?

Currently, my code looks as follows:

block_components = [torch.full((x,x), 1, device='cuda') for x in size_tensor]
output = torch.block_diag(*block_components)

However, this seems a bit slow when I’m working on GPU device. In my case, the size_tensor actually comes from some previous operations on GPU, so it lives on GPU device. Since iterating tensors on GPU is slow, if I change above code to

block_components = [torch.full((x,x), 1, device='cuda') for x in size_tensor.to('cpu')]
output = torch.block_diag(*block_components)

The code runs a bit faster. Howver, this code still runs a operation which sends GPU tensor (size_tensor) to CPU, so this seems not an optimal way to do it.

So my question is, what would be the most efficient way of creating block diagonal matrix in this situation?