Memory efficient way to create block triangular matrices

Hi guys, I want my weight matrix to be block triangular matrix (and maintain that way during training), what will be the memory-efficient way to achieve this? Currently, I’m creating a mask using torch.triu (and Kronecker product with an all-one matrix of my block size), then register_hook using this mask matrix. However, my weight matrix is quite large so I would like to avoid storing the mask matrix.

There’s a straightforward way that I can implement the triangular matrix multiplication using a list, but are there any existing APIs I can use to avoid that path? Thanks.