How to store backprop gradient properly to avoid CUDA out-of-memory error

Hi, I am having issue with backprop gradient that eats up a lot of GPU memory.

Note: I tried to decrease BATCH_SIZE , but it does not help.

Someone told me to use deepspeed zero offload.

However, my code is quite similar to some GNN structure : NN_output = graph.forward(NN_input, types=“f”)

So, outputs = model_engine(inputs) seems does not really fit in my case ? args also does not follow such code styling.

Any idea ?