Adam optimizer.step CUDA OOM

What I know about the problem

  1. Adam is stateful and requires a memory space proportional to the parameters in your model.
  2. Model parameters must be loaded onto device 0
  3. OOM occurs at state[‘exp_avg_sq’] = torch.zeros_like(p.data) which seems to be the last allocation of memory in the optimizer source code.
  4. Neither manually allocating or use of nn.DataParallel prevents OOM error
  5. Moved loss to forward function to reduce memory in training

Below are my training and forward methods

def train(dataloader, vocabulary_dict, epoch):
   
    model = ViralClassification(len(vocabulary_dict), 0.5, 6588 )#, device_ids=[0,1], output_device=1)
    model.to('cuda:0')
    model.print_gpu_memory_info()
    
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    
    cudnn.benchmark = True
    cudnn.enabled = True


    for i in range(epoch):

        running_loss = 0.0

        for (i, (label, sequence)) in enumerate(dataloader):
            
            loss = model(sequence.to('cuda:0'), label.to('cuda:1'))#.to('cuda:0'))
            running_loss += loss.item()            
            loss.backward()
            del loss            
           
            print('gpu_memory_one in training_loop') 
            model.print_gpu_memory_info()
            torch.cuda.empty_cache()

            print('gpu_memory_two in training_loop')
            model.print_gpu_memory_info()
            
            print('bout to step')
            optimizer.step()
            
            
            optimizer.zero_grad() 
            print('bottom of training loop')
def forward(self, inputs, labels):

        inputs = self.embedding(inputs)#.to('cuda:0')
        #print('embedded completed')

        (inputs, hidden_state) = self.bilstm_layer(inputs)
        #print('bilstm completed')
        inputs.to('cuda:1')
        
        self.bilstm_layer.flatten_parameters()
        
        torch.cuda.synchronize('cuda:0')
        torch.cuda.synchronize('cuda:1') 
        
        inputs = self.attention_layer(inputs)
        #print('attention completed')

        inputs = inputs.view(-1, self.row*2*self.lstm_dim)#.to('cuda:1')
        #print('view transformation completed')
        
        inputs = self.mlp_one(inputs, self.relu_one)
        #print('mlp one completed')

        inputs = self.mlp_two(inputs, self.relu_two)
        #print('mlp two completed')

        torch.cuda.synchronize('cuda:0')
        torch.cuda.synchronize('cuda:1')
        
        logits = self._classify(inputs)
        #print('logits completed')

        torch.cuda.empty_cache()
        torch.cuda.synchronize('cuda:0')
        torch.cuda.synchronize('cuda:1')
        #self.print_gpu_memory_info() 

        loss = self.criterion(logits, labels)
        
        return loss

My OOM occurs when I perform optimizer.step.

My problem is that before optimizer.step my memory on device 1 has plenty of open room but since the optimizer performs it calculations on device 0, the OOM occurs.

Is this a problem that checkpointing may be able to solve?
Is it possible to change the location of the optimizer?

I’m not sure, how your model works.
Your forward method seems to use model sharding, i.e. different parts of the model are located on different devices.
However, you are never transferring the inputs to GPU1, since you are not reassigning it in this line of code:

inputs.to('cuda:1')

Also, based on your code you are only calling model.to('cuda:0'), which would mean everything runs on GPU0.
However, it seems that at least your labels are on GPU1, which should create an error in your loss calculation.

Are code parts missing or am I missing something? :slight_smile:

Don’t think you missed anything. Has given me a way forward to solving my problem. Thanks alot for the help.

So the reassignment of inputs acts like a pointer to a device memory space containing the tensors?

model.to(‘cuda:0’) will override explicit statements within the model itself?

Also the reason I am passing my labels to device 1 is because in theory (if I had reassigned) the logits should be output on device 1. I did this because I thought it would reduce memory use on device 0 allowing more room for the optimizer.

The model is for genomic sequence data and can be found at https://www.biorxiv.org/content/10.1101/694851v1. if your interested.

E: I think I pinpointed the error to when the optimizer attempts to step over the embedding layer it exceeds memory because of the size of my vocabulary. When I move it to device 1 I get a OOM on device 1 while the optimizer params stay on device 0. If I move the embedding to cpu. The model trains.

Tensors return a copy on the specified device, while model.to() works recursively on all submodules.

How large is your embedding layer?

Vocabulary dimension is 8,390,658.
Embedding dimension is 100.

Embedding is supposed to represent all 12 char strings and their reverse complement with a nucleotide alphabet.