Linear CPU RAM usage increase with each minibatch?

tenrand · September 19, 2020, 10:08pm

Hi I am new to Pytorch (and ML and NN). I need some general guidance.
I have a basic question that I could not find a straight answer for anywhere.

In an ideal case, should CPU RAM usage be increasing with each mini-batch? To give numbers: train_data size is ~6 million. min-batch size=128.
My ram started at like 5% when epoch started and i am at 23000th training step, and it has gone up to 60%. The increase seems linear in number of mini-batches. (about 2.5% increase with each 1000 mini batches)
However once epoch ends, RAM usage comes back down again. But why should it increase by each min-batch? I dont see it.

For context i am running training on a gpu machine and the flow is something like this:

model=mymodel( parameters) #initialize nn model 
model.to("cuda:0")
train_dataloader=CustomDataloader(train_features)
for epoch in range(n_epochs):
     steps=0
     for X,y in train_dataloader():
           steps+=1
           if steps%1000==0:
              gc.collect()      # i added this line to see if it makes a difference, it didnt.
           X.todevice("cuda:0") 
          #X is actually a custom object which has both tensor data
          # but also some meta data. ".todevice" is a method that send tensors to cuda but doesnt 
          #touch meta data.  Can the problem lie here? it still does not explain why RAM usage 
          #should increase with each minibatch
           y.to("cuda:0" ) 
           yhat=model(X)
           criterion = torch.nn.BCEWithLogitsLoss(weight=weight).to("cuda:0")
     
            loss=criterion(logit,target.to(torch.float32))
            train_losses.append(loss)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

Please not the comment after X.todevice(“cuda:0”).
The dataloader is custom as well

 
 class CustomDataLoader():
    def __init__(self,ds,bs):
        self.ds,self.bs=ds,bs
    def __iter__(self):
        for i in range(0,len(self.ds),self.bs):  
            indices=self.ds[key].data[i:i+self.bs].clone()
        
            batch,y=self.ds.generate_batch(indices) 
          
            yield batch,y

train_features is not a dataset. It is a custom object (a UserDict) with a method “generate_batch” which, given indices, generates a “subset” of the same object. By subset I mean an instance of the same class with smaller data but same meta data.

You may wonder why i am doing it this way. Well, if i try to make train_features a dataset with a getitem that produces single data objects (again, with same meta-data as train_features) and go the usual route, i have to write a collate function that takes these single data objects and “stacks” them up. This seem an unnecessary process in which i first extract single data objects, and then out it back together. i can just cut the middle man and directly produce batches from the object itself.

Does Pytorch’s factory Dataloder has something that i will miss doing it this way?

Thank you for taking the time to read my question and I really appreciate your guidance. Let me know if you need further details.

Update: I just iterated over train_dataloader (and sent things to cuda) without any operations on them, and the RAM use didnt increase. So it is not the dataloader but model probably.

tenrand · September 19, 2020, 11:47pm

Ah, just understood what is happening. I am storing loss in a list which is keeping the whole backward pass graph alive. I need to detach it before storing.
Thank you everyone i will close this silly question.