Model not giving any output after full fine-tunining(Instruction based fine-tuning) on DDP

Hey there below is my DDP Trainer code where everything is working all right the thing here is upon completion of training on the nodes participating in training. snapshot is saved and the saved snapshot is loaded for inference. Also, inference code is also provided below. Since it is instruction-based fine-tuning, I also mentioned the data encoding(tokenization) class for preprocessing the data provided end of this topic.

If you think the code is gibberish and not able to understand it easily, kindly reply to this topic I will respond ASAP.

My thinking:
I’m thinking of something missing in the training step or doing the DDP training incorrectly.

Kindly take a look at it and help me out on this. Thank you very much.

Cmd used(2 machines node_rank will change accordingly):

torchrun --nproc-per-node=1 --nnode=2 --node_rank=0 --rdzv_id=123 --rdzv_endpoint=hostname:12345 --rdzv_backend=c10d Trainer_CPU 5 1

Trainer Code:

def ddp_setup():

class Trainer:
def init(
model: torch.nn.Module,
train_data: DataLoader,
save_every: int,
snapshot_path: str,
) → None:
self.local_rank = int(os.environ[“LOCAL_RANK”])
self.global_rank = int(os.environ[“RANK”])
self.model = model
self.train_data = train_data
self.optimizer = optimizer
self.save_every = save_every
self.epochs_run = 0
self.snapshot_path = snapshot_path
if os.path.exists(snapshot_path):
print(“Loading snapshot”)
self.model = DDP(self.model)

def _load_snapshot(self, snapshot_path):
    snapshot = torch.load(snapshot_path, map_location=torch.device("cpu"))
    self.epochs_run = snapshot["EPOCHS_RUN"]
    print(f"Resuming training from snapshot at Epoch {self.epochs_run}")

def _save_snapshot(self, epoch):
    snapshot = {
        "MODEL_STATE": self.model.module.state_dict(),
        "EPOCHS_RUN": epoch,
    }, self.snapshot_path), "")
    print(f"Epoch {epoch+1} | Training snapshot saved at {self.snapshot_path}")

def _run_batch(self, source, targets):
    output = self.model(source)
    loss = F.cross_entropy(output, targets)
def _custom_run_batch(self, source):
    output = self.model(**source, labels=source["input_ids"].to(torch.long))
    loss = output.loss
    print(f"CPU: {self.global_rank} | Loss: {loss} ") 

def _run_epoch(self, epoch):
    b_sz = len(next(iter(self.train_data))["input_ids"])
    print(f"[CPU{self.global_rank}] Epoch {epoch+1} | Batchsize: {b_sz} | Steps: {len(self.train_data)}")
    for batch in self.train_data:
        batch = {"cpu:{self.local_rank}") for k, v in batch.items()}

def train(self, max_epochs: int):
    for epoch in range(self.epochs_run, max_epochs):
        if (self.local_rank == 0 and epoch % self.save_every == 0) or (epoch == max_epochs-1):

#Need check specific about dataset and dataloader
def prepare_dataloader(dataset: Dataset, batch_size: int):
return DataLoader(


#Loading the Dataset
def load_data(tokenizer):
df = pd.read_csv(“train_data.csv”)
X_train, y_train = df[“document”].tolist(), df[“summary”].tolist()
train_dataset = Encoding(X_train, y_train, tokenizer, max_length=512)

def prepare_dataloader(dataset: list, batch_size: int):
return DataLoader(

def llm_load_train_objs(batch_size):
#Initialize model
model_name = ‘EleutherAI/gpt-neo-1.3B’
tokenize = GPT2Tokenizer.from_pretrained(model_name, bos_token=‘’, eos_token=“”, pad_token=“”, max_length=512)
model = GPTNeoForCausalLM.from_pretrained(model_name)
for buffers in model.buffers():
if buffers.dtype == torch.bool: =
train_data = load_data(tokenize)
dataset = prepare_dataloader(train_data, batch_size)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
return model, dataset, optimizer

def main(save_every: int, total_epochs: int, batch_size: int, snapshot_path: str = “”):
model, dataset, optimizer = llm_load_train_objs(batch_size)
trainer = Trainer(model, dataset, save_every, snapshot_path, optimizer)
start_time = time.time()
end_time = time.time()
print("Total Time taken: “, end_time - start_time, " seconds”)

if name == “main”:
import argparse
parser = argparse.ArgumentParser(description=‘simple distributed training job’)
parser.add_argument(‘total_epochs’, type=int, help=‘Total epochs to train the model’)
parser.add_argument(‘save_every’, type=int, help=‘How often to save a snapshot’)
parser.add_argument(‘–batch_size’, default=2, type=int, help=‘Input batch size on each device (default: 32)’)
args = parser.parse_args()
main(args.save_every, args.total_epochs, args.batch_size)`

Inference code(using jupyter-notebook here):

model_name = “EleutherAI/gpt-neo-1.3B”
tokenize = GPT2Tokenizer.from_pretrained(model_name, bos_token=‘’, eos_token=“”, pad_token=“”, max_length=512)
model = GPTNeoForCausalLM.from_pretrained(model_name)
for buffers in model.buffers():
if buffers.dtype == torch.bool: =
snapshot_path = ‘’
snapshot = torch.load(snapshot_path, map_location=torch.device(‘cpu’))

content = “toyota team europe were banned from the world rally championship.”
instruct_prompt_2 = f’Summarize the given content:{content}\nSummary:’

generated = tokenize(f"{instruct_prompt_2}", return_tensors=‘pt’).input_ids

for buffers in model.buffers():
if buffers.dtype == torch.uint8: =

sample_outputs = model.generate(generated, top_k=50, max_length=200)

predicted_text = tokenize.decode(sample_outputs[0])

Data Encoding class:

class Encoding():
def init(self, content, summary, tokenizer, max_length):
self.input_ids = []
self.attn_masks = []
self.labels = summary = []

    for content, summary in zip(content, summary):
        prep_txt = f'<start>content:{content}\nsummary:{summary}<stop>'
        encodings_dict = tokenizer(prep_txt, truncation=True, max_length=max_length, padding="max_length")
        encodings_dict["input_ids"] = torch.tensor(encodings_dict["input_ids"]).to(torch.int32)
        encodings_dict["attention_mask"] = torch.tensor(encodings_dict["attention_mask"]).to(torch.int8)

def __len__(self):
    return len(self.input_ids)

def __getitem__(self, idx):
    return self.input_ids[idx], self.attn_masks[idx], self.labels[idx]


1 Like

Hi, Not sure if I fully understand what you are asking here.

  1. When you load the model from checkpoint before finetune, are you able to get output from the model?
  2. After fine-tune, in which step did you not see output?
1 Like

Hi @fduwjj Thanks for reading this topic.

What I’m trying?
I’m trying to fine-tune a model for summarization using GPT-NEO-1.3B as the base model, upon using the HuggingFace Trainer class, the fine-tuning succeeds but trying to train with PyTorch-DDP for faster fine-tuning succeeds but the snapshot saved with DDP training is not giving any output it simply giving the pad token which I defined in the tokenizer.

For eg:(Predicted by model)

Summarize the given content:
australia 's current account deficit shrunk by a record 3.300 billion dollars billion us -rrb- in the june quarter due to soaring commodity prices , figures released monday showed .

It is a sample example that I tried on the model after fine-tuning with DDP. Kindly help me with this.

I’m thinking of either the training step might be wrong, any resource suggestion will be great too.

Are you saying that after using DDP for fine-tuning, the model is not working properly now?

Exactly!, @fduwjj model is not giving any response, it simply spits out what I’m giving as prompt.

OK, then this might be hard to debug… OK, one thing you can use here to get the state_dict of your model and see how it gets changed along the way. For example, when you loaded it from torch.load, and after DDP wrap, and after fine-tune before you save it to checkpoint, how does the weights/params change? Are these changes expected? I hope this can give you some glue on what and where things go wrong.

Hi @fduwjj thanks for providing your thoughts, definitely will check and you are saying that after loading the base llm compare it’s weights with fine-tuned lllm, that’s what you are saying right?

I guess I was trying to suggest to keep track of the weight changes along the way (for every stage in your code, for example, loading, before fine-tune, after fine-tune, etc) Because DDP itself is not supposed to change the model and it lets the model training to like as if it happens on single machine.

So it might be because how you fine-tune it subject to batch size? If I understand it correctly, originally, you only use one GPU for fine tune and your batch size is b_1 and now using DDP, you keep the global batch size to be b_1 (batch size per GPU will be b_1 // # of GPUs) too right?

Hi @fduwjj ,sorry for the delayed response.

What you are saying is correct, let me check with the changes of weights as you mentioned.I’ll give me a try!