Where the grades from backward graph will be copied to param.grade in cuda graphs

Hi all

I am currently working with torch.compile cuda graphs, i see for simple MLP layer 3 graphs ,forwad ,backward and loss graphs are separately compiled and data being shared across the graphs. One observation is backward pass output tensors i,e gradients have different memory location that what optimizer is being recived ,also optimizer updated weights are explicitly coped in runtime wrapper.py to parameter tensors.

So i want to know where exactly backwards graph outputs are copied to parameter.grads ,and why its being done.

Based on this will implement similar thing for our own accelerator

@albanD @marksaroufim