Memory Leak with Linear and Relu layer

arnold · July 30, 2019, 2:55am

I’m experiencing memory leak when using a linear layer and a Relu layer.

def forward(self, noise, elem_class):
    
    process = psutil.Process(os.getpid())
    print("Memory 1.1: ", process.memory_info().rss, "bytes", flush=True)
    
    in_vector = torch.cat((noise, elem_class), dim=-1) 
    
    process = psutil.Process(os.getpid())
    print("Memory 1.2: ", process.memory_info().rss, "bytes", flush=True)
    
    in_vector = self.shared_gen_linear(in_vector)
    
    process = psutil.Process(os.getpid())
    print("Memory 1.3: ", process.memory_info().rss, "bytes", flush=True)
    
    in_vector = F.relu(in_vector)           #   batch_size x (2*standard_dim)
    
    process = psutil.Process(os.getpid())
    print("Memory 1.4: ", process.memory_info().rss, "bytes", flush=True)
    
    feats = in_vector[:, :self.standard_dim]
    feats2 = in_vector[:, self.standard_dim:]
    
    for layer in range(0, len(self.feat_gen_linear)):
        feats = self.feat_gen_linear[layer](feats)
    
    feats = feats.view(-1, self.feature_size, self.num_classes)     
    feats = torch.softmax(feats, dim=-1)
    feats = feats.view(-1, self.num_classes*self.feature_size)

    feats2 = self.feats2_gen_linear(feats2)
    feats2 = feats2.view(-1, self.num_classes, 3)
    feats2 = torch.softmax(feats2, dim=-1)
    feats2 = feats2.view(-1, self.num_classes*3)
    
    return feats, feats2

And this layer is defined in the init function of the class.

self.shared_gen_linear = nn.Linear(self.noise_dim + self.num_classes, 2*self.standard_dim)

From the output of the process memory, it appears that the memory leak occurs in

    in_vector = self.shared_gen_linear(in_vector)   
    in_vector = F.relu(in_vector)           #   batch_size x (2*standard_dim)

As seen here,

Memory 1.1: 1217851392 bytes
Memory 1.2: 1217851392 bytes
Memory 1.3: 1218121728 bytes
Memory 1.4: 1218392064 bytes
Memory 1.1: 1219203072 bytes
Memory 1.2: 1219203072 bytes
Memory 1.3: 1219473408 bytes
Memory 1.4: 1220014080 bytes

Any insights on why this memory leak would occur? I am using PyTorch 1.1.0.

ptrblck · July 30, 2019, 4:42pm

Could you post an executable code snippet to reproduce this issue?
Using the forward method and removing undefined parts of the code does not result in a memory leak:

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.shared_gen_linear = nn.Linear(10, 10)
        
    def forward(self, noise, elem_class):
        
        process = psutil.Process(os.getpid())
        print("Memory 1.1: ", process.memory_info().rss, "bytes", flush=True)
        
        in_vector = torch.cat((noise, elem_class), dim=-1) 
        
        process = psutil.Process(os.getpid())
        print("Memory 1.2: ", process.memory_info().rss, "bytes", flush=True)
        
        in_vector = self.shared_gen_linear(in_vector)
        
        process = psutil.Process(os.getpid())
        print("Memory 1.3: ", process.memory_info().rss, "bytes", flush=True)
        
        in_vector = F.relu(in_vector)           #   batch_size x (2*standard_dim)
        
        process = psutil.Process(os.getpid())
        print("Memory 1.4: ", process.memory_info().rss, "bytes", flush=True)
        
        feats = in_vector[:, :5]
        feats2 = in_vector[:, 5:]
        
        process = psutil.Process(os.getpid())
        print("Memory 1.4: ", process.memory_info().rss, "bytes", flush=True)
        
        return feats, feats2


model = MyModel()
model(torch.randn(1, 5), torch.randn(1, 5))
>Memory 1.1:  247803904 bytes
Memory 1.2:  247803904 bytes
Memory 1.3:  247803904 bytes
Memory 1.4:  247803904 bytes
Memory 1.4:  247803904 bytes

arnold · July 30, 2019, 6:44pm

I’ve found the problem while creating an example executable code.

The problem actually is not related to those two operations at all, even though the memory seems to accumulate after those operations.

In fact, within my original training code, I concatenated the loss tensors instead of adding the loss, thus increasing the memory size required. Specifically, I had

total_gen_loss += gen_loss

instead of

total_gen_loss += gen_loss.item()

where gen_loss is an outputted torch.tensor (e.g., gen_loss == tensor(2.9921, grad_fn=))

For some reason, the memory accumulation occurs during the Linear and Relu layer operations. However, fixing the loss addition stopped the memory accumulation.

Thank you for your help!