T5 model training stops without any error

sreraku · April 2, 2023, 2:16am

Hi
After fighting out I was able to successfuly setup my GPU and could see that pytorch can see it in the conda environment.

Now I am running a T5 multilabel classification model. It starts training and i can see the gpu fan, temp go up as well as volatile gpu-util (through nvidia-smi).

Now the dataset is relatively small from jigsaw dataset. when i train, it says 2:47:00 hrs and goes well upto 14%. After that it just stops. The jupyter notebook shows it as still training but all the digits never change. nvidia-smi shows the memory is taken up on gpu but nothing running on it.

here is my model config:
self.SEED = 42
self.MODEL_PATH = ‘t5-base’

    # data
    self.TOKENIZER = T5Tokenizer.from_pretrained(self.MODEL_PATH)
    self.SRC_MAX_LENGTH = 320
    self.TGT_MAX_LENGTH = 20
    self.BATCH_SIZE = 8
    self.VALIDATION_SPLIT = 0.25

    # model
    self.DEVICE = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
    self.FULL_FINETUNING = True
    self.LR = 3e-5
    self.OPTIMIZER = 'AdamW'
    self.CRITERION = 'BCEWithLogitsLoss'
    self.SAVE_BEST_ONLY = True
    self.N_VALIDATE_DUR_TRAIN = 3
    self.EPOCHS = 1

code for training:
def train(
model,
train_dataloader,
val_dataloader,
criterion,
optimizer,
scheduler,
epoch
):

# we validate config.N_VALIDATE_DUR_TRAIN times during the training loop
nv = config.N_VALIDATE_DUR_TRAIN
temp = len(train_dataloader) // nv
temp = temp - (temp % 100)
validate_at_steps = [temp * x for x in range(1, nv + 1)]

train_loss = 0
for step, batch in enumerate(tqdm(train_dataloader, 
                                  desc='Epoch ' + str(epoch))):
    # set model.eval() every time during training
    model.train()
    
    # unpack the batch contents and push them to the device (cuda or cpu).
    b_src_input_ids = batch['src_input_ids'].to(device)
    b_src_attention_mask = batch['src_attention_mask'].to(device)

    labels = batch['tgt_input_ids'].to(device)
    labels[labels[:, :] == config.TOKENIZER.pad_token_id] = -100

    b_tgt_attention_mask = batch['tgt_attention_mask'].to(device)

    # clear accumulated gradients
    optimizer.zero_grad()

    # forward pass
    outputs = model(input_ids=b_src_input_ids, 
                    attention_mask=b_src_attention_mask,
                    labels=labels,
                    decoder_attention_mask=b_tgt_attention_mask)
    loss = outputs[0]
    train_loss += loss.item()

    # backward pass
    loss.backward()

    # update weights
    optimizer.step()
    
    # update scheduler
    scheduler.step()

    if step in validate_at_steps:
        print(f'-- Step: {step}')
        _ = val(model, val_dataloader, criterion)

avg_train_loss = train_loss / len(train_dataloader)
print('Training loss:', avg_train_loss)

there is no error when it stops. how do i figure out whats wrong please?
Epoch 0: 14%|█████████▏ | 2179/15957 [23:21<2:27:46, 1.55it/s]
stays like that forever

Appreciate any help please

AbdulsalamBande · April 2, 2023, 2:36am

It seems that your training process is freezing or getting stuck. Here are a few suggestions

Interrupt and Debug : If the training process is stuck, try interrupting the kernel in Jupyter Notebook (Kernel > Interrupt).This may show you the error message or traceback if there was an issue during training
Update libraries: Ensure that you are using the latest version of PyTorch, Transformers, and other libraries. Updating the libraries might resolve compatibility or bug issues that could be causing the training to freeze.
Reduce complexity : You can try training on a smaller subset of the dataset or using a smaller model to test if the issue persists.

sreraku · April 2, 2023, 3:41am

Thanks

I tried interrupt. It definitely interrupts but doesnt show any error.

i have the latest libraries (infact I setup my conda environment yest successfuly and checked if pytorch recognizes GPU in conda env)

i am running baseline. no added layers. 2 epochs, batchsize=8

AbdulsalamBande · April 2, 2023, 7:26pm

You can try these suggestions

Monitor GPU usage: Continuously monitor GPU usage using watch -n 1 nvidia-smi in a terminal during training. This will allow you to see any fluctuations in GPU usage, memory consumption, and temperature. Check if there’s any sudden drop in GPU utilization when the training gets stuck.
Increase the frequency of validation: Increase the frequency of validation by modifying the validate_at_steps variable. This can help you identify if there’s an issue with the validation process or if the training loop itself is causing the problem.
Add more print statements: Add more print statements in the training and validation functions to monitor the progress at different stages. For example, print the step, loss value, and GPU usage at each iteration. This can help you identify the exact point at which the training gets stuck.
Test with a smaller dataset: Try running your code with a smaller dataset (e.g., just a few samples) to see if the problem persists. This will help you rule out any issues related to the size of the dataset.
Try running the code outside Jupyter Notebook: Try running your training code as a standalone Python script to see if any issues arise. Jupyter Notebook can sometimes mask issues that would be more apparent in a regular Python script.
Test with a different model: To rule out any issues with the specific model you’re using (T5), you can try training a different model (e.g., BERT, GPT-2) using the same dataset and training loop. This can help you determine if the issue is related to the specific model or the training process in general.
Check for any issues with the dataset: Make sure that your dataset is properly formatted and doesn’t contain any corrupted or problematic samples. You can add print statements to display the input and target data during training to check if there are any issues.

_Rayhan · January 12, 2024, 7:12pm

I am having the same issue.

I have tried all the above methods to no avail.

I use a fixed seed for reproducibility. Yet it just stops at a certain point. The GPU memory stays allocated however the utilisation goes down to idle ie 4-10%.

awethaileslassie · June 26, 2024, 8:56pm

Hi everyone,

I’m encountering the same issue and have tried all the suggested solutions, but nothing has worked so far.

For context, I am using a shuffling, despite this, the training process halts at a certain point epoch and iteration. The GPU memory remains allocated, but the utilization drops to idle (0%).
Upon further investigate I found out that it is getting stuck during the backward pass:

scaler.scale(total_loss).backward()

Has anyone found a solution to this problem? Any advice or insights would be greatly appreciated.

Thank you!

ptrblck · June 26, 2024, 9:11pm

Are you seeing this issue at exactly the same iteration (even if the data is shuffled) or does it depend on the used data sample?

awethaileslassie · June 27, 2024, 4:36pm

@ptrblck Yes, exactly at epoch 475 and batch iteration 701. I am training a yolop model for 1000 epochs on 200K image samples.

ptrblck · June 27, 2024, 5:21pm

Check the input tensors at this particular iteration and make sure your script is able to load and process these. If it’s not specific to the data index, try to narrow down what exactly differs in this iteration (e.g. are you starting a validation run etc.).

awethaileslassie · June 27, 2024, 7:22pm

@ptrblck I added a line by line logging to check where it is getting stuck and it seems it is stuck during the backward step:
FYI, I am using a custom data loader that proportionally distributes each batch with samples from multiple datasets. In this specific instance, I am utilizing 5 datasets, with varying sizes. Consequently, the batch size changes from 128 to 113 because some datasets are nearing the end of their iterations. However, these datasets will continue loading the next items b/c I am using a cycling method to ensure continuous training.

46%|████▌ | 699/1515 [04:28<04:52, 2.79it/s]Epoch: 475, Iteration: 699 - Starting batch processing
Input data shapes: [torch.Size([128, 3, 384, 640])]
Target data shapes: [torch.Size([512, 7]), torch.Size([128, 1, 384, 640])]
Epoch: 475, Iteration: 699 - Inference done
Model outputs: [torch.Size([128, 3, 48, 80, 8]), torch.Size([128, 3, 24, 40, 8]), torch.Size([128, 3, 12, 20, 8])]
Epoch: 475, Iteration: 699 - Loss calculated
Total loss: tensor([0.0787], device=‘cuda:0’, grad_fn=)
Individual losses: (0.04480060935020447, 0.017336051911115646, 0.011137641966342926, 0.0, 0.005461167544126511)
Epoch: 475, Iteration: 699 - Zeroing gradients
Epoch: 475, Iteration: 699 - Backward pass done
Epoch: 475, Iteration: 699 - Max gradient norm: 33046.3046875
Epoch: 475, Iteration: 699 - Optimizer step done
Learning rate for param group 0: 5.86715775205671e-05

46%|████▌ | 700/1515 [04:28<04:41, 2.89it/s]Epoch: 475, Iteration: 700 - Starting batch processing
Input data shapes: [torch.Size([128, 3, 384, 640])]
Target data shapes: [torch.Size([436, 7]), torch.Size([128, 1, 384, 640])]
Epoch: 475, Iteration: 700 - Inference done
Model outputs: [torch.Size([128, 3, 48, 80, 8]), torch.Size([128, 3, 24, 40, 8]), torch.Size([128, 3, 12, 20, 8])]
Epoch: 475, Iteration: 700 - Loss calculated
Total loss: tensor([0.0880], device=‘cuda:0’, grad_fn=)
Individual losses: (0.04204834625124931, 0.015310650691390038, 0.021696986630558968, 0.0, 0.008894836530089378)
Epoch: 475, Iteration: 700 - Zeroing gradients
Epoch: 475, Iteration: 700 - Backward pass done
Epoch: 475, Iteration: 700 - Max gradient norm: 103459.15625
Epoch: 475, Iteration: 700 - Optimizer step done
Learning rate for param group 0: 5.86715775205671e-05

46%|████▋ | 701/1515 [04:29<04:32, 2.98it/s]Epoch: 475, Iteration: 701 - Starting batch processing
Input data shapes: [torch.Size([113, 3, 384, 640])]
Target data shapes: [torch.Size([458, 7]), torch.Size([113, 1, 384, 640])]
Epoch: 475, Iteration: 701 - Inference done
Model outputs: [torch.Size([113, 3, 48, 80, 8]), torch.Size([113, 3, 24, 40, 8]), torch.Size([113, 3, 12, 20, 8])]
Epoch: 475, Iteration: 701 - Loss calculated
Total loss: tensor([0.0830], device=‘cuda:0’, grad_fn=)
Individual losses: (0.046016838401556015, 0.017843084409832954, 0.011571710929274559, 0.0, 0.0075934575870633125)
Epoch: 475, Iteration: 701 - Zeroing gradients

awethaileslassie · July 4, 2024, 5:58pm

Solved the issue I was getting. I will post here in case it would be useful for others. The problem resolves after setting drop_last=True in the Pytorch DataLoader, it suggests that the issue was related to uneven batch sizes at the end of the dataset. NB: The drop_last=True parameter ignores the last batch (when the number of examples in the dataset is not divisible by the batch_size )

ptrblck · July 4, 2024, 7:31pm

Good to hear you’ve solved the issue and thanks for sharing your solution.