`torch.load` does not map to GPU as advertised

I have made some progress on this issue, by using on torch.cuda.memory_summary rather than mytensor.device. It turns out that using map_location with options that specify the GPU reserves GPU memory but then stores the model.device as “cpu”. In particular, it seems that memory is stored on the GPU and then immediately freed. However, when you use the default map_location, it does not reserve GPU memory at all, GPU memory is never allocated and thus never freed (at least until you call mytensor.to()).

By tracking process CPU memory with psutil, I found that using map_location to GPU causes the total CPU process memory to spike immediately after calling torch.load. However, if you use the map_location default, the total CPU memory does not increase until after you call mytensor.to(). In particular, both approaches (with/without map_location and then to()) end up using about the same process CPU memory, but using map_location to GPU gets to that high number after torch.load but map_location to CPU doesn’t get there until after .to.

I’m left wondering whether this was intended behavior.