Help CUDA error: out of memory

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I’m getting this error message when try to load a pytorch model in flask application

1 Like

The error is raised if you are running our of memory on your device, so you could try to reduce the memory requirement e.g. by lowering the batch size (if possible).

Hi, I got the same error when calling “x.to(“cuda:3”)”, where x = torch.randn(1,1).

= = = = = = = = = = = = = = = = = =
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
= = = = = = = = = = = = = = = = = =

However, I was able to call “x.to(“cuda:2”)”. Running nvidia-smi gives

±------------------------------±---------------------±---------------------+
| 2 GeForce RTX 208… Off | 00000000:67:00.0 Off | N/A |
| 27% 32C P8 1W / 250W | 4MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 GeForce RTX 208… Off | 00000000:68:00.0 Off | N/A |
| 28% 38C P8 28W / 250W | 21MiB / 11016MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

I suspected the problem is related to setting non_blocking=True when using “some_tensor.to(device, non_blocking=True)”? I’ve been successfully running my actual code all day, which uses 4 gpus (cuda:0-cuda:3), but at some point I started getting the above error. If I recall correctly, it might happen after I killed one of two “python main.py” processes.

This sounds rather like a setup issue and unrelated to the usage of non_blocking=True.
Check, if some dead processes are still using the device. In case the CUDA context is corrupt in the current session, start a new Python process. However, since the error seems to have popped up suddenly, you might also check if a restart of your workstation could help.

Hey,

Try to kill the process. For that do the following:

  1. nvidia-smi
  2. In the lower board you will see the processes that are running in your gpu’s
  3. Check their PID
  4. Delete those processes making kill PID_Number

Hope it helps

1 Like

Hi @ptrblck , I hope you are well. sorry, I solve heaps of error from the code below, and now I am in the stage of this error. I try to use different batch_size as 4 which is very low, but still have this error. do you have any idea? is something wrong with the code structure? did I miss anything? many many thnaks for your help.

#!/usr/bin/env python
# coding: utf-8

# In[1]:


from torch.utils.data import DataLoader
from transformers import TextDataset,DataCollatorForLanguageModeling
#from transformers import  AutoModelWithLMHead
from transformers import  AutoModelForCausalLM
from transformers import AdamW, get_linear_schedule_with_warmup
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import random
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm, trange
import gc
import math
import os
import time
import datetime
import torch
import torch.distributed as dist
import sys
## the directory include the package from INVIDIA
#sys.path.append('/home/momenisa/GPU_ZIP_Apex/apex-master/apex/')
#from apex import amp
#from apex.parallel import DistributedDataParallel as DDP
import torch.multiprocessing as mp
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
import os
import random
import pandas as pd
######################
weight_decay=0
learning_rate=5e-5
adam_epsilon=1e-8
warmup_steps = 1e2
lr=5e-5
Max_length=400

PathData='/home//NLP_Projects/CaseSummary_resolutionProject/Results_GPT_2/model_v4200_k_bs=16_lr=5e-05_epochs=20/'
pretrained_model='/home///GPT_2/'


################################################
class GPT2Dataset(Dataset):

    def __init__(self, txt_list, tokenizer, gpt2_type=pretrained_model, max_length=400):

        self.tokenizer = tokenizer
        self.input_ids = []
        self.attn_masks = []

        for txt in txt_list:

            encodings_dict = tokenizer('<|startoftext|>'+ txt + '<|endoftext|>', truncation=True, max_length=max_length, padding="max_length")

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx] 

######################################################3
def ddp_setup(rank, world_size):
    """
    Args:
        rank: Unique identifier of each process
        world_size: Total number of processes
    """
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    init_process_group(backend="nccl", rank=rank, world_size=world_size)
#########################################################

def main(rank: int, world_size: int, save_every: int, total_epochs: int, batch_size: int):
    
    gpu_id=rank
    
    Path='/home//NLP_Projects/CaseSummary_resolutionProject/Results_GPT_2/multipleGPU/model_v4'\
    'data_'+str(200)+'_k'+'_'+'bs='+str(batch_size)+'_lr='+str(learning_rate)+'_epochs='+str(total_epochs)
    
    print(Path)
    
    Results_Path=Path+'/Results/'
    ss=os.path.isdir(Results_Path)
    if ss==False:
        os.makedirs(Results_Path)

    print(Results_Path)
        
    print(PathData)
    
    print(rank)

    ### defined variable ###############
    seed_val = 42
    random.seed(seed_val)
    np.random.seed(seed_val)
    torch.manual_seed(seed_val)
    torch.cuda.manual_seed_all(seed_val)

    ###############################
    
    ddp_setup(rank, world_size)
    
    ###############################
    
    tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model, bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>') #gpt2-small

    model = GPT2LMHeadModel.from_pretrained(pretrained_model)

    model.resize_token_embeddings(len(tokenizer))

    ## loading traina and tets dataset
    print(PathData)
    trains_titles=pd.read_csv(PathData+'/'+'traindata.csv')
    valid_titles=pd.read_csv(PathData+'/'+'validdata.csv')
    
    trains_titles=trains_titles.drop(columns=['Unnamed: 0'])['0']
    valid_titles=valid_titles.drop(columns=['Unnamed: 0'])['0']

    train_dataset = GPT2Dataset(trains_titles, tokenizer, max_length=Max_length)

    Val_dataset = GPT2Dataset(valid_titles, tokenizer, max_length=Max_length)
    
    ############################################################################
    
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                                       batch_size=batch_size,
                                                        pin_memory=True,
                                                        shuffle=False,
                                                       sampler=DistributedSampler(train_dataset))

    validation_loader= torch.utils.data.DataLoader(dataset=Val_dataset,
                                                       batch_size=batch_size,
                                                        pin_memory=True,
                                                        shuffle=False,
                                                       sampler=DistributedSampler(Val_dataset))
    
    
    total_steps = len(train_loader) * total_epochs


    ################# define optimizer and scheduler#########################

    # Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
    optimizer = AdamW(model.parameters(), lr = learning_rate,eps = adam_epsilon)


    # Create the learning rate scheduler.
    # This changes the learning rate as the training loop progresses
    scheduler = get_linear_schedule_with_warmup(optimizer, 
                                                num_warmup_steps = warmup_steps, 
                                                num_training_steps = total_steps)

    ############################## train_loader and validation_loader ######################3
 
    training_steps_per_epoch=len(train_loader)
    total_num_training_steps = int(training_steps_per_epoch*total_epochs)

  ######################## applying DDP on the model for training ############################
    model=model.to(gpu_id)
    model = DDP(model, device_ids=[gpu_id])
    print("gpu_id",gpu_id)
    # ========================================
    #               Training
    # ========================================

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, total_epochs))
    print('Training...')
        
    training_stats = []
           

    for epoch_i in range(0, total_epochs):

        ##########################################
        train_loader.sampler.set_epoch(epoch_i)
        b_sz = len(next(iter(train_loader))[0])
        print(f"[GPU{gpu_id}] Epoch {epoch_i} | Batchsize: {b_sz} | Steps: {len(train_loader)}")
        train_loader.sampler.set_epoch(epoch_i)
        ##########################################

        t0 = time.time()

        total_train_loss = 0

        model.train()

        for step, batch in enumerate(train_loader):

            #################################
            b_input_ids = batch[0].to(gpu_id)
            b_labels = batch[0].to(gpu_id)
            b_masks = batch[1].to(gpu_id)
            #################################

            optimizer.zero_grad()        

            outputs = model(  b_input_ids,
                             labels=b_labels, 
                              attention_mask = b_masks,
                              token_type_ids=None
                            )

            loss = outputs[0]  
            batch_loss = loss.item()
            total_train_loss += batch_loss
            loss.backward()
            optimizer.step()
            scheduler.step()

        # Calculate the average loss over all of the batches.
        avg_train_loss = total_train_loss / len(train_dataloader)  

        del total_train_loss
        del batch_loss
        
        # Measure how long this epoch took.
        training_time = format_time(time.time() - t0)
        # ========================================
        #               Validation
        # ========================================

        print("")
        print("Running Validation...")

        avg_val_loss_1=[]
        t0 = time.time()
        #################### is this section corrcet for validation ???????????????????? #############
        model.eval()
        model = DDP(model, device_ids=[gpu_id])
        ########################################3
        total_eval_loss = 0
        nb_eval_steps = 0
        
        ########################################
        validation_loader.sampler.set_epoch(epoch_i)
        b_sz = len(next(iter(validation_loader))[0])
        print("bz",b_sz)
        print(f"[GPU{gpu_id}] Epoch {epoch_i} | Batchsize: {b_sz} | Steps: {len(validation_loader)}")
        validation_loader.sampler.set_epoch(epoch_i)
        ###########################################
        
        # Evaluate data for one epoch
        for batch in validation_loader:

            b_input_ids = batch[0].to(gpu_id)
            b_labels = batch[0].to(gpu_id)
            b_masks = batch[1].to(gpu_id)

            with torch.no_grad():        
                outputs  = model(b_input_ids,attention_mask = b_masks,labels=b_labels)
                loss = outputs[0]  
            batch_loss = loss.item()
            total_eval_loss += batch_loss        

        avg_val_loss = total_eval_loss / len(validation_dataloader)

        perplexity=math.exp(avg_val_loss)

        avg_val_loss_1.append(avg_val_loss)

        validation_time = format_time(time.time() - t0)    

        del total_eval_loss 


        print("  Validation Loss: {0:.2f}".format(avg_val_loss))
        print("  Validation took: {:}".format(validation_time))

        # Record all statistics from this epoch.
        training_stats.append(
            {
                'epoch': epoch_i + 1,
                'Training Loss': avg_train_loss,
                'Valid. Loss': avg_val_loss,
                'Training Time': training_time,
                'Validation Time': validation_time,
                'perplexity': perplexity
            }
        )
        gc.collect()
        
        ################### saving the model ########################

        if gpu_id == 0 and epoch_i % save_every == 0:
           
            Path2=Results_Path+'/'+'savemodel_epoch='+str(epoch_i)
    
            ss=os.path.isdir(Path2)
            if ss==False:
                os.makedirs(Path2)
                
            Pathmodel=Results_Path+'/'+'savemodel_epoch='+str(epoch_i)

            ss=os.path.isdir(Pathmodel)
            if ss==False:
                os.makedirs(Pathmodel)

            ckp = model.module.state_dict()
            torch.save(ckp, Pathmodel+'/ "checkpoint.pt"')
    ############ save the results #####################3          
    Path_2=pt_save_directory+'/'+'training_stats='+str(0)+".csv"
    torch.save(training_stats,Path_2)

    #### is a good place to put the destrop process ??????????????????? ###########
    destroy_process_group()
    #############################
if __name__ == '__main__':
    import sys
    total_epochs=21
    save_every=5
    batch_size=4
    world_size = torch.cuda.device_count()
    print(world_size)
    mp.spawn(main, args=(world_size, save_every, total_epochs, batch_size), nprocs=world_size,join=True)

I don’t see any obvious mistakes in your code and the OOM could be expected depending on the used model and GPU. Did you estimate the memory requirements or why do you think a larger batch size should fit?

many thanks for your reply. I run the model one time with batch_sie =16 on one GPU and it works. how I can estimate the needed memory?

@ptrblck , it is my final code, for training it works, my concern is validation part. I decide to run it on one gpu because I have no view what will be going on when we use ddp for validation part. as I understand by default the rank=0 is the master GPU. I sort out the code as here. and use model.module() for evaluation as I search and see people’s recommendation. is the code meaningful for you? it is in my mind that the ultimate model will be in rank=0 that for each epoch I can apply validation part and save the model. is it ok?

#!/usr/bin/env python
# coding: utf-8

# In[1]:


from torch.utils.data import DataLoader
from transformers import TextDataset,DataCollatorForLanguageModeling
#from transformers import  AutoModelWithLMHead
from transformers import  AutoModelForCausalLM
from transformers import AdamW, get_linear_schedule_with_warmup
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import random
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm, trange
import gc
import math
import os
import time
import datetime
import torch
import torch.distributed as dist
import sys
import torch.multiprocessing as mp
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
import os
import random
import pandas as pd
######################
weight_decay=0
learning_rate=5e-5
adam_epsilon=1e-8
warmup_steps = 1e2
lr=5e-5
Max_length=400

PathData='/home//NLP_Projects/CaseSummary_resolutionProject/Results_GPT_2/model_v4200_k_bs=16_lr=5e-05_epochs=20/'
pretrained_model='/home/momenisa//GPT_2/'

########################################
def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

################################################
class GPT2Dataset(Dataset):

    def __init__(self, txt_list, tokenizer, gpt2_type=pretrained_model, max_length=400):

        self.tokenizer = tokenizer
        self.input_ids = []
        self.attn_masks = []

        for txt in txt_list:

            encodings_dict = tokenizer('<|startoftext|>'+ txt + '<|endoftext|>', truncation=True, max_length=max_length, padding="max_length")

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx] 

######################################################3
def ddp_setup(rank, world_size):
    """
    Args:
        rank: Unique identifier of each process
        world_size: Total number of processes
    """
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    os.environ['CUDA_VISIBLE_DEVICES'] = "0,1,2"
    init_process_group(backend="nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)
#########################################################

def main(rank: int, world_size: int, save_every: int, total_epochs: int, batch_size: int):
    
    gpu_id=rank

    print(Results_Path)
        
    print(PathData)
    
    print(rank)

    ### defined variable ###############
    seed_val = 42
    random.seed(seed_val)
    np.random.seed(seed_val)
    torch.manual_seed(seed_val)
    torch.cuda.manual_seed_all(seed_val)

    ###############################
    
    ddp_setup(rank, world_size)
    
    ###############################
    
    tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model, bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>') #gpt2-small

    model = GPT2LMHeadModel.from_pretrained(pretrained_model)

    model.resize_token_embeddings(len(tokenizer))

    ## loading traina and tets dataset
    print(PathData)
    trains_titles=pd.read_csv(PathData+'/'+'traindata.csv')
    valid_titles=pd.read_csv(PathData+'/'+'validdata.csv')
    
    trains_titles=trains_titles.drop(columns=['Unnamed: 0'])['0']
    valid_titles=valid_titles.drop(columns=['Unnamed: 0'])['0']

    print(trains_titles.head(2))
    
    train_dataset = GPT2Dataset(trains_titles, tokenizer, max_length=Max_length)

    Val_dataset = GPT2Dataset(valid_titles, tokenizer, max_length=Max_length)
    
    ############################################################################
    
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                                       batch_size=batch_size,
                                                        pin_memory=True,
                                                        shuffle=False,
                                                       sampler=DistributedSampler(train_dataset))

  # For validation the order doesn't matter, so we'll just read them sequentially.
        validation_loader = DataLoader(
                Val_dataset, # The validation samples.
                sampler = SequentialSampler(Val_dataset), # Pull out batches sequentially.
                batch_size = batch_size # Evaluate with this batch size.
            )
    
    
    total_steps = len(train_loader) * total_epochs


    ################# define optimizer and scheduler#########################

    # Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
    optimizer = AdamW(model.parameters(), lr = learning_rate,eps = adam_epsilon)


    # Create the learning rate scheduler.
    # This changes the learning rate as the training loop progresses
    scheduler = get_linear_schedule_with_warmup(optimizer, 
                                                num_warmup_steps = warmup_steps, 
                                                num_training_steps = total_steps)

    ############################## train_loader and validation_loader ######################3
 
    training_steps_per_epoch=len(train_loader)
    total_num_training_steps = int(training_steps_per_epoch*total_epochs)

  ######################## applying DDP on the model for training ############################
    model=model.to(gpu_id)
    model = DDP(model, device_ids=[gpu_id])
    print("gpu_id",gpu_id)
    # ========================================
    #               Training
    # ========================================


        
    training_stats = []
           

    for epoch_i in range(0, total_epochs):
        
        print("")
        print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, total_epochs))
        print('Training...')

        ##########################################
        train_loader.sampler.set_epoch(epoch_i)
        b_sz = len(next(iter(train_loader))[0])
        print(f"[GPU{gpu_id}] Epoch {epoch_i} | Batchsize: {b_sz} | Steps: {len(train_loader)}")
        train_loader.sampler.set_epoch(epoch_i)
        ##########################################

        t0 = time.time()

        total_train_loss = 0

        model.train()

        for step, batch in enumerate(train_loader):

            #################################
            b_input_ids = batch[0].to(gpu_id,non_blocking=True)
            b_labels = batch[0].to(gpu_id,non_blocking=True)
            b_masks = batch[1].to(gpu_id,non_blocking=True)
            #################################

            optimizer.zero_grad()        

            outputs = model(  b_input_ids,
                             labels=b_labels, 
                              attention_mask = b_masks,
                              token_type_ids=None
                            )

            loss = outputs[0]  
            batch_loss = loss.item()
            total_train_loss += batch_loss
            loss.backward()
            optimizer.step()
            scheduler.step()

        # Calculate the average loss over all of the batches.
        avg_train_loss = total_train_loss / len(train_loader)  

        del total_train_loss
        del batch_loss
        
        # Measure how long this epoch took.
        training_time = format_time(time.time() - t0)
        # ========================================
        #               Validation
        # ========================================

        print("")
        print("Running Validation...")

        avg_val_loss_1=[]
        t0 = time.time()
        #################### is this section corrcet for validation  #############
        if gpu_id == 0:

            model.eval()
            ########################################3
            total_eval_loss = 0
            nb_eval_steps = 0

            # Evaluate data for one epoch
            for batch in validation_loader:

                b_input_ids = batch[0].to(gpu_id)
                b_labels = batch[0].to(gpu_id)
                b_masks = batch[1].to(gpu_id)

                with torch.no_grad():        
                    outputs  = model.module(b_input_ids,attention_mask = b_masks,labels=b_labels)
                    loss = outputs[0]  
                batch_loss = loss.item()
                total_eval_loss += batch_loss        

            avg_val_loss = total_eval_loss / len(validation_loader)

            perplexity=math.exp(avg_val_loss)

            avg_val_loss_1.append(avg_val_loss)

            validation_time = format_time(time.time() - t0)    

            del total_eval_loss 


            print("  Validation Loss: {0:.2f}".format(avg_val_loss))
            print("  Validation took: {:}".format(validation_time))

            # Record all statistics from this epoch.
            training_stats.append(
                {
                    'epoch': epoch_i + 1,
                    'Training Loss': avg_train_loss,
                    'Valid. Loss': avg_val_loss,
                    'Training Time': training_time,
                    'Validation Time': validation_time,
                    'perplexity': perplexity
                }
            )
            gc.collect()
        
        ################### saving the model ########################

            Path2=Results_Path+'/'+'savemodel_epoch='+str(epoch_i)
    
            ss=os.path.isdir(Path2)
            if ss==False:
                os.makedirs(Path2)

            ckp = model.module.state_dict()
            torch.save(ckp, Pathmodel+'/ "checkpoint.pt"')
        ############ save the results #####################3          
    Path_2=pt_save_directory+'/'+'training_stats='+str(0)+".csv"
    torch.save(training_stats,Path_2)

    #### is a good place to put the destrop process ###########
    destroy_process_group()
    #############################
if __name__ == '__main__':
    import sys
    total_epochs=int(sys.argv[1])
    save_every=int(sys.argv[2])
    batch_size=int(sys.argv[3])
    world_size = (torch.cuda.device_count())-1
    print(world_size)
    mp.spawn(main, args=(world_size, save_every, total_epochs, batch_size), nprocs=world_size,join=True)

You done need to use a single GPU for the validation run and could take a look at the ImageNet example to see how the validation dataset is processed in a multi-gpu setup. Also, since apparently a larger batch size was working before I would also recommend trying to reproduce this setup to see what is using more memory now.

1 Like

@ptrblck the (ImageNet example) code is hard to follow

Hi @ptrblck , I hope you are well. sorry, I need to define extra loss for the GPT-2. the code is as the following, the issue is that the size of the lables (4,400) and the logits is (4,400,50258) very different, I studied the loss function of GPT2 the source code and use the same process but still the size are different.do you have any idea how I can improve the code? many thanks.


    for step, batch in enumerate(train_dataloader):

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)
        
        optimizer.zero_grad()        

        outputs = model(  b_input_ids,
                          labels=b_labels, 
                          attention_mask = b_masks,
                          token_type_ids=None
                        )
   
        loss, logits = outputs[:2]
    
        # Shift so that tokens < n predict n
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = b_labels[..., 1:].contiguous()

        # Flatten the tokens
        loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

I don’t know which loss function you are using, but some expect model outputs and targets in a different shape (e.g. nn.CrossEntroyLoss and nn.NLLLoss if class labels are used for a multi-class classification/segmentation use case).
You might need to check the docs of the used loss function to see which shapes are expected.

@ptrblck many thanks for your answer. I want to define a new loss for example cosine distance, which need the vector be in the same size, any idea about that? indeed I want to update the model with new loss which is

loss+extra defined loss

If you want to use e.g. a cosine distance you would need to make sure your model is returning the correct outputs which can be processed by this criterion together with the targets.
This is thus more a question of your model definition and how you are interpreting its outputs than a technical problem.

1 Like

Hi @ptrblck , I hope you are well. I tried the code that we have discussed before for fine tunning the gpt2 with multiple GPUS. but when I load the model to generate sentences the results are very strange. it just gave me back the inputs as it is and the expected generated sentence is padding token. I would appreciate your idea, where can be wrong in the code. saving and loading the model? or training the model. I really appreciate your help.

######################
weight_decay=0
learning_rate=7e-5
adam_epsilon=1e-8
warmup_steps = 1e2
lr=5e-5
Max_length=400

PathData='/home//NLP_Projects/
pretrained_model = '/home//GPT_NEO_1.3B/'

#########################

def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

################################################
class GPT2Dataset(Dataset):

    def __init__(self, txt_list, tokenizer, gpt2_type=pretrained_model, max_length=400):

        self.tokenizer = tokenizer
        self.input_ids = []
        self.attn_masks = []

        for txt in txt_list:

            encodings_dict = tokenizer('<|startoftext|>'+ txt + '<|endoftext|>', truncation=True, max_length=max_length, padding="max_length")

            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx] 

######################################################3
def ddp_setup(rank, world_size):
    """
    Args:
        rank: Unique identifier of each process
        world_size: Total number of processes
    """
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    os.environ['CUDA_VISIBLE_DEVICES'] = "0,1,2"

    init_process_group(backend="nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)
#########################################################

def main(rank: int, world_size: int, save_every: int, total_epochs: int, batch_size: int):
    
    gpu_id=rank
    
    ### defined variable ###############
    seed_val = 42
    random.seed(seed_val)
    np.random.seed(seed_val)
    torch.manual_seed(seed_val)
    torch.cuda.manual_seed_all(seed_val)

    ###############################
    
    ddp_setup(rank, world_size)
    
    ###############################
    
    tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model, bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>') #gpt2-small
  
    model_or = GPTNeoForCausalLM.from_pretrained(pretrained_model)

    model_or.resize_token_embeddings(len(tokenizer))

    ## loading traina and tets dataset
    print(PathData)
    trains_titles=pd.read_csv(PathData+'/'+'traindata.csv')
    valid_titles=pd.read_csv(PathData+'/'+'validdata.csv')
    
    trains_titles=trains_titles.drop(columns=['Unnamed: 0'])['0']
    valid_titles=valid_titles.drop(columns=['Unnamed: 0'])['0']

    print(trains_titles.head(2))
    
    train_dataset = GPT2Dataset(trains_titles, tokenizer, max_length=Max_length)

    Val_dataset = GPT2Dataset(valid_titles, tokenizer, max_length=Max_length)
    
    ############################################################################
    
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                                       batch_size=batch_size,
                                                        pin_memory=True,
                                                        shuffle=False,
                                                       sampler=DistributedSampler(train_dataset))

    validation_loader= torch.utils.data.DataLoader(dataset=Val_dataset,
                                                       batch_size=batch_size,
                                                        pin_memory=True,
                                                        shuffle=False,
                                                       sampler=DistributedSampler(Val_dataset))
    
    
    total_steps = len(train_loader) * total_epochs


    ################# define optimizer and scheduler#########################

    # Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
    optimizer = AdamW(model_or.parameters(), lr = learning_rate,eps = adam_epsilon)


    # Create the learning rate scheduler.
    # This changes the learning rate as the training loop progresses
    scheduler = get_linear_schedule_with_warmup(optimizer, 
                                                num_warmup_steps = warmup_steps, 
                                                num_training_steps = total_steps)

    ############################## train_loader and validation_loader #########
 
    training_steps_per_epoch=len(train_loader)
    total_num_training_steps = int(training_steps_per_epoch*total_epochs)

    model = copy.deepcopy(model_or)

    model=model.to(gpu_id)
    model = DDP(model, device_ids=[gpu_id])
    print("gpu_id",gpu_id)
    # ========================================
    #               Training
    # ========================================

    training_stats = []
           

    for epoch_i in range(0, total_epochs):
        
        print("")
        print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, total_epochs))
        print('Training...')

        ##########################################
        train_loader.sampler.set_epoch(epoch_i)
        b_sz = len(next(iter(train_loader))[0])
        print(f"[GPU{gpu_id}] Epoch {epoch_i} | Batchsize: {b_sz} | Steps: {len(train_loader)}")
        train_loader.sampler.set_epoch(epoch_i)
        ##########################################

        t0 = time.time()

        total_train_loss = 0

        model.train()

        for step, batch in enumerate(train_loader):
        #    print("len(train_loader)",len(train_loader))
            #################################
            b_input_ids = batch[0].to(gpu_id,non_blocking=True)
            b_labels = batch[0].to(gpu_id,non_blocking=True)
            b_masks = batch[1].to(gpu_id,non_blocking=True)
            #################################

            optimizer.zero_grad()        

            outputs = model(  b_input_ids,
                             labels=b_labels, 
                              attention_mask = b_masks,
                              token_type_ids=None
                            )

            loss = outputs[0]  
            batch_loss = loss.item()
            total_train_loss += batch_loss
            loss.backward()
            optimizer.step()
            scheduler.step()

        # Calculate the average loss over all of the batches.
        avg_train_loss = total_train_loss / len(train_loader)  
        print("avg_train_loss",avg_train_loss)
        Path_3=pt_save_directory+'/'+'trainingloss='+str(gpu_id)+str(epoch_i)+".csv"
        torch.save(avg_train_loss,Path_3)
        
        del total_train_loss
        del batch_loss
        
        # Measure how long this epoch took.
        training_time = format_time(time.time() - t0)
        print("  Training epoch took: {:}".format(training_time))

        gc.collect()
        
        ################### saving the model ########################

        if gpu_id == 0:
    
            Path2=Results_Path+'/'+'savemodel_epoch=='+str(epoch_i)
    
            ss=os.path.isdir(Path2)
            if ss==False:
                os.makedirs(Path2)

            ckp = model.module.state_dict()
            torch.save(ckp, Path2+"/checkpoint.pt")

    
    destroy_process_group()
    #############################
if __name__ == '__main__':
    import sys
    total_epochs=int(sys.argv[1])
    save_every=int(sys.argv[2])
    batch_size=int(sys.argv[3])
    world_size = (torch.cuda.device_count())-1
    print(world_size)
    mp.spawn(main, args=(world_size, save_every, total_epochs, batch_size), nprocs=world_size,join=True)

I load the model in this way


pretrained_model = '/home/momenisa//GPT_2/'

model = GPTNeoForCausalLM.from_pretrained(pretrained_model)

tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model) #gpt2-small

model.resize_token_embeddings(len(tokenizer))

CHECKPOINT_PATH='/home//checkpoint.pt'
model.load_state_dict(torch.load(CHECKPOINT_PATH,map_location='cpu'),strict=False)
model.eval()

and results are

@ptrblck , I am waiting for your idea I really apprceiate, do you think the code has an issue? indeed the model is not trained well. I change the way that I decoding is just send me back what ever I feed it as input no text generation properly. :(, I compare the results when I used one GPU there is no pad token and it generate the text for me. big different between 1 gpu and multilpes

Hi @ptrblck , I hope you are well. Sorry, my code was running before but now when I want to load the model it gave me this error and I get stuck. I would appreciate if you could please let me know the solution. Many many thanks for your help

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
model_id="/home//sentence-transformers/Llama-2-7b-hf"

tokenizer = LlamaTokenizer.from_pretrained(model_id)

model = LlamaForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map={'': torch.cuda.current_device()}, torch_dtype=torch.float16)

/home/momenisa/.local/lib/python3.8/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed).
  from pandas.core.computation.check import NUMEXPR_INSTALLED

bin /home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/instantclient_21_1: did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/libexec/../../hadoop-hdfs/lib/*'), PosixPath('/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/libexec/../../hadoop-mapreduce/*'), PosixPath('/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/libexec/../../hadoop/*'), PosixPath('/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/libexec/../../hadoop/lib/*'), PosixPath('/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/libexec/../../hadoop-yarn/lib/*'), PosixPath('/opt/cloudera/parcels/CDH/lib/hive/lib/*'), PosixPath('/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/libexec/../../hadoop-yarn/*'), PosixPath('/usr/lib/sqoop/lib/*'), PosixPath('/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/libexec/../../hadoop-hdfs/*'), PosixPath('/opt/cloudera/parcels/CDH-5.13.3-1.cdh5.13.3.p0.2/lib/hadoop/libexec/../../hadoop-mapreduce/lib/*')}
  warn(msg)
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/lib/sqoop/lib/*'), PosixPath('/opt/cloudera/parcels/CDH/lib/hive/lib/*')}
  warn(msg)
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/user/momenisa/oauth_callback')}
  warn(msg)
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('servers!server=momenisa/", "access'), PosixPath('["access'), PosixPath('servers!user=momenisa"]')}
  warn(msg)
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//127.0.0.1'), PosixPath('http'), PosixPath('8081/hub/api')}
  warn(msg)
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//127.0.0.1'), PosixPath('http'), PosixPath('8081/hub/api/users/momenisa/activity')}
  warn(msg)
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/user/momenisa')}
  warn(msg)
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('53435/user/momenisa'), PosixPath('//127.0.0.1'), PosixPath('http')}
  warn(msg)
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/lab')}
  warn(msg)
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
  warn(msg)
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
  warn(msg)
/home/momenisa/.local/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
  warn(msg)
/opt/conda/lib/python3.8/site-packages/setuptools/distutils_patch.py:25: UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
  warnings.warn(

All errors are raised from bitsandbytes and are unrelated to PyTorch. It seems a locally installed CUDA toolkit is needed, which cannot be found.

1 Like

CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

This happened after pinokio asked me to download a new cuda version and videosuite…
I thtink it’s downloaded in two places possibly? idk what to do and i saw you were on a similar subject here.