Ensemble model training memory footprint

I have 5 identical MLP models that I want to train in parallel on a single GPU and they are relatively small. Each has its own dataset so there’s no overlap in data or model parameters.

I load both the model and dataset in a for loop, assign them to an object and I append the object to an object list. During optimization, I iterate through these objects and I run a forward/backward pass+step through the model.

In nvidia-smi, I see only the memory footprint of a single model. Is this a bug where I’m training only one model or a Pytorch optimization of some sort?

I stepped through the optimization loop with pdb and all the model object hashes seem different, the datasets are different. What else can I check to confirm I’m optimizing through the right model/dataset?

# initialize data
dataset_array = []
for i in datasets:
    ds_obj = {}
    ds = Dataset(**kwargs)
    ds_obj['dataset'] = ds

# initialize models
model_array = []
for i in models:
    model_obj = {}
    model = MyModel(**kwargs)
    optimizer = Adam(**kwargs)
    model_obj['model'] = model
    model_obj['optimizer'] = optimizer
# optimize
for i in range(max_iterations):
    for idx, model_obj in enumerate(model_array):
        dataset = dataset_array[idx]['dataset']
        model = model_array[idx]['model']
        optim = model_array[idx]['optimizer']
        model_input, GT = dataset[i]
        model_output = model(model_input)
        loss = loss_fn(model_output, GT)

Your code looks fine and I see an increase in memory usage:

class MyModel(nn.Module):
    def __init__(self):
        # should use 4GB
        self.fc = nn.Linear(1024*32, 1024*32, bias=False)
    def forward(self, x):
        x = self.fc(x)
        return x

# make sure no memory is allocated
# 0.0

# initialize models
model_array = []
for i in range(3):
    model_obj = {}
    model = MyModel().cuda()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    model_obj['model'] = model
    model_obj['optimizer'] = optimizer
    print("iter {}, memory allocated {}GB".format(i, torch.cuda.memory_allocated()/1024**3))
# iter 0, memory allocated 4.0GB
# iter 1, memory allocated 8.0GB
# iter 2, memory allocated 12.0GB