How to reduce TPU RAM usage when calculating large tensors

I am currently working on dealing with the task vector of the llama’s parameters, which is huge (over 13015864320). And I have a tensor shape (3,13015864320) in float16. Every time I try to get kthvalue or do element wise multiplication on this Tensor, RAM will always go unreasonably high (over 300GB). Is there any way to get this Tensor in batch and do calculations sequentially?
Moreover, I found that Numpy with exact same computation (partition replace kth_value and np.multiply instead *)would use much less memory, can someone explain why this would happen?