Multiple GPU use significant first GPU memory consumption

There is an issue in which if you load a state_dict, it maps to gpu. Once loaded it’s never fred.

This means that when you do:
model.load_state_dict(torch.load(path_to_weights)) those weights are using gpu memory never fred.
Try this way.
state_dict = torch.load(directory, map_location=lambda storage, loc: storage) model.load_state_dict(state_dict)
it will force state dict to be loaded on RAM instead of gpu directly.

Regarding the issue you mentioned, it’s not very typical.
In general, dataparallel duplicates everything in each gpu, thus, there shouldn’t be higher memory comsuption.

Extra compsumtion usually comes from one of these 3 issues:
input/output tensor are gathered on main gpu (which should denote an slightly higher mem comp)
optimizer memory requirments
and state_dict issues.

2 Likes