I am trying to finetune llama 2 7B model. The model I am using is: https://huggingface.co/daryl149/llama-2-7b-chat-hf.
My GPU has a VRAM of ~49GB. So to finetune efficiently I am trying to do Quantisation alone or Quantization + Lora with the help of peft
library.
Here is my model loading configuration
def model_loader():
if torch.cuda.is_available():
device_map = {"": 0}
else:
device_map = None
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=False,
bnb_4bit_compute_dtype=bfloat16
)
# Check GPU compatibility with bfloat16
compute_dtype = getattr(torch, "float16")
if compute_dtype == torch.float16:
try:
major, _ = torch.cuda.get_device_capability()
if major >= 8:
print("=" * 80)
print("Your GPU supports bfloat16: accelerate training with bf16=True")
print("=" * 80)
except Exception as e:
print(e)
## Llama model
model = AutoModelForCausalLM.from_pretrained(
"daryl149/llama-2-7b-chat-hf",
device_map=device_map,
quantization_config=bnb_config,
)
model.config.use_cache = False
# ## Setup LoRA
from peft import PeftModelForSeq2SeqLM, LoraConfig
config = LoraConfig(
r= 64,
target_modules= ['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'],
lora_alpha= 16,
lora_dropout= 0.1,
bias= "none",
task_type= "CAUSAL_LM",
)
peft_model = PeftModelForSeq2SeqLM(model, config)
print(peft_model.print_trainable_parameters())
tokenizer = AutoTokenizer.from_pretrained("daryl149/llama-2-7b-chat-hf", padding_side="left", truncation_side="left")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
return model, tokenizer
Case (i): When I comment out the below snippet while loading the model above, i.e I thought to finetune only with quantization and not to use lora.
from peft import PeftModelForSeq2SeqLM, LoraConfig
config = LoraConfig(
r= 64,
target_modules= ['q_proj', 'k_proj', 'down_proj', 'v_proj', 'gate_proj', 'o_proj', 'up_proj'],
lora_alpha= 16,
lora_dropout= 0.1,
bias= "none",
task_type= "CAUSAL_LM",
)
peft_model = PeftModelForSeq2SeqLM(model, config)
print(peft_model.print_trainable_parameters())
I kept the batch size=1
for debugging convenience. At epoch 1, the first iteration ran as expected. Everytime exactly in the second iteration the output logits
became nan, because of which the loss
turned to nan.
I used print()
statements in the source code locally to find out where was nan
origininating.
Attaching a relevant snippet from the source code of hugging face transformer llama model.
file path: Github Repo.
Below snippet is taken from the Llama model
class (You can ignore all lines except the last line)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
) -> Union[Tuple, BaseModelOutputWithPast]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
use_cache = use_cache if use_cache is not None else self.config.use_cache
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError(
"You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
)
if self.gradient_checkpointing and self.training and use_cache:
logger.warning_once(
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
)
use_cache = False
if inputs_embeds is None:
inputs_embeds = self.embed_tokens(input_ids)
...
# code snippet continues
The last line inputs_embeds = self.embed_tokens(input_ids)
which takes in token ids and suppose to return embeddings is returning both -inf
values and nan
values in it. Its dimension is torch.Size([1, 1024, 4096])
, where 1 → batch size, 1024 → token length, 4096 → feature representation dimension.
Note: Not all the 1024 rows are nan/-inf
. Infact there are also rows which are partially filled with nan/-inf. But both occur at the same time.
Where self.embed_tokens()
is defined as
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
I am really clueless of why embedding layer has a play in producing nan/-inf
. Typically those issues arise from overflowing values, denominator issues, exponential function issues.
My questions;
- Does quantization affect the embedding layer. if yes, why?
- Does
padding_side='left'
orpadding_side='right'
for the tokenizer has any play for this issue?. As suggested here, I set thepadding_side='left'
.
Edit: I believe I have sufficiently described the issue, @ptrblck any thoughts on why this is happening?