Hello,
I have a PyTorch algorithm model (compiled into .pt file) that do torch.prod of tensor in shape (1000,400,400,144)
The algorithm takes 10 seconds on strong nvidia GPU (for exmaple A100).
I am trying to find way to make it run faster.
For now, the only effective optimization was using BFloat16.
You have any suggestion for other optimization?
Thanks!