Could you try your code again with torch.from_numpy instead of directly wrapping the numpy array in a torch.Tensor?
It’s not recommended to use your approach.
I am also facing the same issue, and can reproduce from the provided code snippet. Could we get some help please?
UPDATE: Running torch.cuda.empty_cache() after the operation reduces memory for batch_size 1 to a reasonable amount of memory. As OP said, there is no issue in running this snippet with batch size > 1