nn.Embedding layer returning Nan and -inf values

I am trying to finetune llama 2 7B model. The model I am using is: https://huggingface.co/daryl149/llama-2-7b-chat-hf.

My GPU has a VRAM of ~49GB. So to finetune efficiently I am trying to do Quantisation alone or Quantization + Lora with the help of peft library.

Here is my model loading configuration

def model_loader():
    if torch.cuda.is_available():
        device_map = {"": 0}

    else:
        device_map = None
        
    bnb_config = transformers.BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_use_double_quant=False,
        bnb_4bit_compute_dtype=bfloat16
    )
    
    # Check GPU compatibility with bfloat16
    compute_dtype = getattr(torch, "float16")
    if compute_dtype == torch.float16:
        try:
            major, _ = torch.cuda.get_device_capability()
            if major >= 8:
                print("=" * 80)
                print("Your GPU supports bfloat16: accelerate training with bf16=True")
                print("=" * 80)
        except Exception as e:
            print(e)

    ## Llama model
    model = AutoModelForCausalLM.from_pretrained(
        "daryl149/llama-2-7b-chat-hf",
        device_map=device_map,
        quantization_config=bnb_config,
    )
    model.config.use_cache = False
    
    # ## Setup LoRA
    from peft import PeftModelForSeq2SeqLM, LoraConfig
    config = LoraConfig(
        r= 64,
        target_modules= ['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'],
        lora_alpha= 16,
        lora_dropout= 0.1,
        bias= "none",
        task_type= "CAUSAL_LM",
    )
    peft_model = PeftModelForSeq2SeqLM(model, config)
    print(peft_model.print_trainable_parameters())

    tokenizer = AutoTokenizer.from_pretrained("daryl149/llama-2-7b-chat-hf", padding_side="left", truncation_side="left")
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id
    
    return model, tokenizer

Case (i): When I comment out the below snippet while loading the model above, i.e I thought to finetune only with quantization and not to use lora.

    from peft import PeftModelForSeq2SeqLM, LoraConfig
    config = LoraConfig(
        r= 64,
        target_modules= ['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'],
        lora_alpha= 16,
        lora_dropout= 0.1,
        bias= "none",
        task_type= "CAUSAL_LM",
    )
    peft_model = PeftModelForSeq2SeqLM(model, config)
    print(peft_model.print_trainable_parameters())

I kept the batch size=1 for debugging convenience. At epoch 1, the first iteration ran as expected. Everytime exactly in the second iteration the output logits became nan, because of which the loss turned to nan.

I used print() statements in the source code locally to find out where was nan origininating.
Attaching a relevant snippet from the source code of hugging face transformer llama model.
file path: Github Repo.
Below snippet is taken from the Llama model class (You can ignore all lines except the last line)

def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        cache_position: Optional[torch.LongTensor] = None,
    ) -> Union[Tuple, BaseModelOutputWithPast]:
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if (input_ids is None) ^ (inputs_embeds is not None):
            raise ValueError(
                "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
            )

        if self.gradient_checkpointing and self.training and use_cache:
            logger.warning_once(
                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
            )
            use_cache = False
        
        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)
       ... 
       # code snippet continues
       

The last line inputs_embeds = self.embed_tokens(input_ids) which takes in token ids and suppose to return embeddings is returning both -inf values and nan values in it. Its dimension is torch.Size([1, 1024, 4096]), where 1 → batch size, 1024 → token length, 4096 → feature representation dimension.
Note: Not all the 1024 rows are nan/-inf. Infact there are also rows which are partially filled with nan/-inf. But both occur at the same time.

Where self.embed_tokens() is defined as

self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)

I am really clueless of why embedding layer has a play in producing nan/-inf. Typically those issues arise from overflowing values, denominator issues, exponential function issues.

My questions;

  1. Does quantization affect the embedding layer. if yes, why?
  2. Does padding_side='left' or padding_side='right' for the tokenizer has any play for this issue?. As suggested here, I set the padding_side='left'.

Edit: I believe I have sufficiently described the issue, @ptrblck any thoughts on why this is happening?