I’m training a 14B model using FSDP. By monitoring the torch.cuda.max_memory_allocated(), I found that it exceeds my GPU’s maximal memory by around 60GB.
I’m wonding why this happen? Does pytorch use some sort of virtual memory management? If so, is there any documentation?
I does not do that. The only case I’ve seen this happen is when running on Windows machines where the OS is automagically doing “unified memory like” feature that we cannot control in any ways.
Thanks for the reply! I’m currently using a linux machine (ubuntu 24.02) with CUDA 12.8. As far as I know, the unified memory seems not present in my system.