How does torch.compile decide on when to call cudaMalloc
inside a torch.compile function? I have looked at PT2 paper and searched this issue on github. But I am unable understand the procedure. I am still going through ASPLOS 2024 resources and code to understand the procedure. Any other helpful resources are welcome.