Linear CPU RAM usage increase with each minibatch?

Hi I am new to Pytorch (and ML and NN). I need some general guidance.
I have a basic question that I could not find a straight answer for anywhere.

In an ideal case, should CPU RAM usage be increasing with each mini-batch? To give numbers: train_data size is ~6 million. min-batch size=128.
My ram started at like 5% when epoch started and i am at 23000th training step, and it has gone up to 60%. The increase seems linear in number of mini-batches. (about 2.5% increase with each 1000 mini batches)
However once epoch ends, RAM usage comes back down again. But why should it increase by each min-batch? I dont see it.

For context i am running training on a gpu machine and the flow is something like this:

model=mymodel( parameters) #initialize nn model 
model.to("cuda:0")
train_dataloader=CustomDataloader(train_features)
for epoch in range(n_epochs):
     steps=0
     for X,y in train_dataloader():
           steps+=1
           if steps%1000==0:
              gc.collect()      # i added this line to see if it makes a difference, it didnt.
           X.todevice("cuda:0") 
          #X is actually a custom object which has both tensor data
          # but also some meta data. ".todevice" is a method that send tensors to cuda but doesnt 
          #touch meta data.  Can the problem lie here? it still does not explain why RAM usage 
          #should increase with each minibatch
           y.to("cuda:0" ) 
           yhat=model(X)
           criterion = torch.nn.BCEWithLogitsLoss(weight=weight).to("cuda:0")
     
            loss=criterion(logit,target.to(torch.float32))
            train_losses.append(loss)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

Please not the comment after X.todevice(“cuda:0”).
The dataloader is custom as well

 
 class CustomDataLoader():
    def __init__(self,ds,bs):
        self.ds,self.bs=ds,bs
    def __iter__(self):
        for i in range(0,len(self.ds),self.bs):  
            indices=self.ds[key].data[i:i+self.bs].clone()
        
            batch,y=self.ds.generate_batch(indices) 
          
            yield batch,y

train_features is not a dataset. It is a custom object (a UserDict) with a method “generate_batch” which, given indices, generates a “subset” of the same object. By subset I mean an instance of the same class with smaller data but same meta data.

You may wonder why i am doing it this way. Well, if i try to make train_features a dataset with a getitem that produces single data objects (again, with same meta-data as train_features) and go the usual route, i have to write a collate function that takes these single data objects and “stacks” them up. This seem an unnecessary process in which i first extract single data objects, and then out it back together. i can just cut the middle man and directly produce batches from the object itself.

Does Pytorch’s factory Dataloder has something that i will miss doing it this way?

Thank you for taking the time to read my question and I really appreciate your guidance. Let me know if you need further details.

Update: I just iterated over train_dataloader (and sent things to cuda) without any operations on them, and the RAM use didnt increase. So it is not the dataloader but model probably.

Ah, just understood what is happening. I am storing loss in a list which is keeping the whole backward pass graph alive. I need to detach it before storing.
Thank you everyone i will close this silly question. :slight_smile:

1 Like