T5 model training stops without any error

Hi
After fighting out I was able to successfuly setup my GPU and could see that pytorch can see it in the conda environment.

Now I am running a T5 multilabel classification model. It starts training and i can see the gpu fan, temp go up as well as volatile gpu-util (through nvidia-smi).

Now the dataset is relatively small from jigsaw dataset. when i train, it says 2:47:00 hrs and goes well upto 14%. After that it just stops. The jupyter notebook shows it as still training but all the digits never change. nvidia-smi shows the memory is taken up on gpu but nothing running on it.

here is my model config:
self.SEED = 42
self.MODEL_PATH = ‘t5-base’

    # data
    self.TOKENIZER = T5Tokenizer.from_pretrained(self.MODEL_PATH)
    self.SRC_MAX_LENGTH = 320
    self.TGT_MAX_LENGTH = 20
    self.BATCH_SIZE = 8
    self.VALIDATION_SPLIT = 0.25

    # model
    self.DEVICE = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
    self.FULL_FINETUNING = True
    self.LR = 3e-5
    self.OPTIMIZER = 'AdamW'
    self.CRITERION = 'BCEWithLogitsLoss'
    self.SAVE_BEST_ONLY = True
    self.N_VALIDATE_DUR_TRAIN = 3
    self.EPOCHS = 1

code for training:
def train(
model,
train_dataloader,
val_dataloader,
criterion,
optimizer,
scheduler,
epoch
):

# we validate config.N_VALIDATE_DUR_TRAIN times during the training loop
nv = config.N_VALIDATE_DUR_TRAIN
temp = len(train_dataloader) // nv
temp = temp - (temp % 100)
validate_at_steps = [temp * x for x in range(1, nv + 1)]

train_loss = 0
for step, batch in enumerate(tqdm(train_dataloader, 
                                  desc='Epoch ' + str(epoch))):
    # set model.eval() every time during training
    model.train()
    
    # unpack the batch contents and push them to the device (cuda or cpu).
    b_src_input_ids = batch['src_input_ids'].to(device)
    b_src_attention_mask = batch['src_attention_mask'].to(device)

    labels = batch['tgt_input_ids'].to(device)
    labels[labels[:, :] == config.TOKENIZER.pad_token_id] = -100

    b_tgt_attention_mask = batch['tgt_attention_mask'].to(device)

    # clear accumulated gradients
    optimizer.zero_grad()

    # forward pass
    outputs = model(input_ids=b_src_input_ids, 
                    attention_mask=b_src_attention_mask,
                    labels=labels,
                    decoder_attention_mask=b_tgt_attention_mask)
    loss = outputs[0]
    train_loss += loss.item()

    # backward pass
    loss.backward()

    # update weights
    optimizer.step()
    
    # update scheduler
    scheduler.step()

    if step in validate_at_steps:
        print(f'-- Step: {step}')
        _ = val(model, val_dataloader, criterion)

avg_train_loss = train_loss / len(train_dataloader)
print('Training loss:', avg_train_loss)

there is no error when it stops. how do i figure out whats wrong please?
Epoch 0: 14%|█████████▏ | 2179/15957 [23:21<2:27:46, 1.55it/s]
stays like that forever

Appreciate any help please

It seems that your training process is freezing or getting stuck. Here are a few suggestions

  • Interrupt and Debug : If the training process is stuck, try interrupting the kernel in Jupyter Notebook (Kernel > Interrupt).This may show you the error message or traceback if there was an issue during training
  • Update libraries: Ensure that you are using the latest version of PyTorch, Transformers, and other libraries. Updating the libraries might resolve compatibility or bug issues that could be causing the training to freeze.
  • Reduce complexity : You can try training on a smaller subset of the dataset or using a smaller model to test if the issue persists.

Thanks

I tried interrupt. It definitely interrupts but doesnt show any error.

i have the latest libraries (infact I setup my conda environment yest successfuly and checked if pytorch recognizes GPU in conda env)

i am running baseline. no added layers. 2 epochs, batchsize=8

You can try these suggestions

  1. Monitor GPU usage: Continuously monitor GPU usage using watch -n 1 nvidia-smi in a terminal during training. This will allow you to see any fluctuations in GPU usage, memory consumption, and temperature. Check if there’s any sudden drop in GPU utilization when the training gets stuck.
  2. Increase the frequency of validation: Increase the frequency of validation by modifying the validate_at_steps variable. This can help you identify if there’s an issue with the validation process or if the training loop itself is causing the problem.
  3. Add more print statements: Add more print statements in the training and validation functions to monitor the progress at different stages. For example, print the step, loss value, and GPU usage at each iteration. This can help you identify the exact point at which the training gets stuck.
  4. Test with a smaller dataset: Try running your code with a smaller dataset (e.g., just a few samples) to see if the problem persists. This will help you rule out any issues related to the size of the dataset.
  5. Try running the code outside Jupyter Notebook: Try running your training code as a standalone Python script to see if any issues arise. Jupyter Notebook can sometimes mask issues that would be more apparent in a regular Python script.
  6. Test with a different model: To rule out any issues with the specific model you’re using (T5), you can try training a different model (e.g., BERT, GPT-2) using the same dataset and training loop. This can help you determine if the issue is related to the specific model or the training process in general.
  7. Check for any issues with the dataset: Make sure that your dataset is properly formatted and doesn’t contain any corrupted or problematic samples. You can add print statements to display the input and target data during training to check if there are any issues.

I am having the same issue.

I have tried all the above methods to no avail.

I use a fixed seed for reproducibility. Yet it just stops at a certain point. The GPU memory stays allocated however the utilisation goes down to idle ie 4-10%.