If I use torch.compile to compile the whole graph,in the my own compiler, how to manage the memory in my own compiler?

if I use torch.compile to compile the whole graph,in the my own compiler ,in forward stage,
1.if I enable memory reuse in the forward pass,how the backwards get the activation to calcute the gradient?has there some example in pytorch?
2.if i disable memory reuse,if i enable some op fusion,A op+B op fuison to one op, so the A output value is in sram or local memory or gloabal memory , torch can’t get the activation, in backwards how to calcute the gradient?has there some example in pytorch?
3.how to manage the memory in my own compiler to use torch.compile to speed up the training?
4.how the backwards (autogradient) get the activation from my own compiler ? the memory of every op in the graph must be in ddr?