So I was playing around trying to learn gradient checkpointing. I found an interesting behavior that does not match my understanding of the paper I found that there was a sweet spot for the number of checkpoints and going beyond that memory would increase. I found the exact same behavior with checkpoint_sequential
and checkpoint
. Here’s a link to my code with checkpoint_sequential
(go back a dir for the non-sequential). I am working on a 2080 Super and I just made a overly simple linear model (linear, relus, and end with a sigmoid) such that it barely doesn’t fit into memory (if you want code in here I can replicate with a network that will look nicer). I get the following results
| Num Splits | SMI Memory |
| 2 | 7212MB |
| 4 | 6000MB |
| 8 | 5428MB |
| 10 | 5810MB |
| 16 | 6190MB |
The non-sequential version has similar results (one splitting actually uses 7908MB!). cuda.max_memory_allocated()
shows the same trend (but smaller numbers) but cuda.memory_allocated()
does not.
So my understanding is that the more checkpoints I do the lower my memory usage should be. Rather my results show something very different and reminds me more of how there are optimal number of CPUs to use in parallel processing (where overhead starts to use too many resources). So:
- Do I have a bug somewhere?
- Am I misunderstanding the paper?
- If so, how do I find the optimal number of checkpoints?