I have made some progress on this issue, by using on torch.cuda.memory_summary
rather than mytensor.device
. It turns out that using map_location
with options that specify the GPU reserves GPU memory but then stores the model.device as “cpu”. In particular, it seems that memory is stored on the GPU and then immediately freed. However, when you use the default map_location
, it does not reserve GPU memory at all, GPU memory is never allocated and thus never freed (at least until you call mytensor.to()
).
By tracking process CPU memory with psutil
, I found that using map_location
to GPU causes the total CPU process memory to spike immediately after calling torch.load
. However, if you use the map_location
default, the total CPU memory does not increase until after you call mytensor.to()
. In particular, both approaches (with/without map_location
and then to()
) end up using about the same process CPU memory, but using map_location
to GPU gets to that high number after torch.load
but map_location
to CPU doesn’t get there until after .to
.
I’m left wondering whether this was intended behavior.