How to store backprop gradient properly to avoid CUDA out-of-memory error

promach · June 17, 2022, 11:49am

Note: I tried to decrease BATCH_SIZE , but it does not help.

promach · June 18, 2022, 5:01am

Someone told me to use deepspeed zero offload.

However, my code is quite similar to some GNN structure : NN_output = graph.forward(NN_input, types=“f”)

So, outputs = model_engine(inputs) seems does not really fit in my case ? args also does not follow such code styling.

Any idea ?