Torch.cuda.memory._dump_snapshot() not activating on second iteration of model.py

I’m doing a research project built off of Andrej Karpathy’s nanoGPT. I’ve run into a GPU memory problem and I’m trying to use torch.cuda.memory._dump_snapshot() to find the error. “model.py” is called in a loop in “train.py” as “train.py” loads the data. The problem is that while torch.cuda.memory._dump_snapshot() works on the first iteration of the “model.py” it doesn’t seem to do anything on the iteration causing the crash, so I can’t see the problem.

I’ve tried only loading torch.cuda.memory._record_memory_history(max_entries=100000) after a certain number of iteration, but that seems to have no effect.

Any help would be greatly appreciated.
Thank you,
Leigh

My solution was to not call estimate_loss() and call torch.cuda.memory._record_memory_history(max_entries=100000) before every micro_step and torch.cuda.memory._record_memory_history(enabled=None) after every micro_step. This allowed me to see what was going on inside my GPU. However, I’m still not sure I could produce the memory dumps after further calls to model().