Why `max_memory_allocated()` exceeds physical GPU memory size?

GS_H · April 30, 2025, 3:03am

I’m training a 14B model using FSDP. By monitoring the torch.cuda.max_memory_allocated(), I found that it exceeds my GPU’s maximal memory by around 60GB.

I’m wonding why this happen? Does pytorch use some sort of virtual memory management? If so, is there any documentation?

Thanks!

albanD · April 30, 2025, 6:58pm

Hey!

I does not do that. The only case I’ve seen this happen is when running on Windows machines where the OS is automagically doing “unified memory like” feature that we cannot control in any ways.

GS_H · May 6, 2025, 2:48am

Thanks for the reply! I’m currently using a linux machine (ubuntu 24.02) with CUDA 12.8. As far as I know, the unified memory seems not present in my system.

I plan to investigate the source code.

GS_H · May 12, 2025, 9:58am

@ptrblck Have you ever seen this phenomenon before? I’d appreciate if you provide any reference to the related topics