where rest_idx is a sorted array with any possible elements between 0 and 10474. The code I wrote above will not work because the slices are no leafs anymore. The way that I divide the test_tensor into two parts firstly and concatenate two parts of this tensor afterward is not applicable I guess because the selected indices can be very random.

Is there any efficient way to solve this problem? Thanks a lot in advance!

could you explain further? I think the way you suggested is only to set the selected elements of the test_tensor to zero. I do want the rest parts to not be optimized.

given test_leaf=[1,2,3] and mask [1,1,0], that code does
test_tensor = [1,2,0] + [0,0,3] = [1,2,3]

because the second summand was detached, test_leaf’s gradient will be zero at positions where mask is zero

note that usual adaptive optimizers like Adam won’t calculate gradient moments correctly for such parameters. There is optim.SparseAdam, but I’m not sure how to use it.