Hi
After fighting out I was able to successfuly setup my GPU and could see that pytorch can see it in the conda environment.
Now I am running a T5 multilabel classification model. It starts training and i can see the gpu fan, temp go up as well as volatile gpu-util (through nvidia-smi).
Now the dataset is relatively small from jigsaw dataset. when i train, it says 2:47:00 hrs and goes well upto 14%. After that it just stops. The jupyter notebook shows it as still training but all the digits never change. nvidia-smi shows the memory is taken up on gpu but nothing running on it.
here is my model config:
self.SEED = 42
self.MODEL_PATH = ‘t5-base’
# data
self.TOKENIZER = T5Tokenizer.from_pretrained(self.MODEL_PATH)
self.SRC_MAX_LENGTH = 320
self.TGT_MAX_LENGTH = 20
self.BATCH_SIZE = 8
self.VALIDATION_SPLIT = 0.25
# model
self.DEVICE = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
self.FULL_FINETUNING = True
self.LR = 3e-5
self.OPTIMIZER = 'AdamW'
self.CRITERION = 'BCEWithLogitsLoss'
self.SAVE_BEST_ONLY = True
self.N_VALIDATE_DUR_TRAIN = 3
self.EPOCHS = 1
code for training:
def train(
model,
train_dataloader,
val_dataloader,
criterion,
optimizer,
scheduler,
epoch
):
# we validate config.N_VALIDATE_DUR_TRAIN times during the training loop
nv = config.N_VALIDATE_DUR_TRAIN
temp = len(train_dataloader) // nv
temp = temp - (temp % 100)
validate_at_steps = [temp * x for x in range(1, nv + 1)]
train_loss = 0
for step, batch in enumerate(tqdm(train_dataloader,
desc='Epoch ' + str(epoch))):
# set model.eval() every time during training
model.train()
# unpack the batch contents and push them to the device (cuda or cpu).
b_src_input_ids = batch['src_input_ids'].to(device)
b_src_attention_mask = batch['src_attention_mask'].to(device)
labels = batch['tgt_input_ids'].to(device)
labels[labels[:, :] == config.TOKENIZER.pad_token_id] = -100
b_tgt_attention_mask = batch['tgt_attention_mask'].to(device)
# clear accumulated gradients
optimizer.zero_grad()
# forward pass
outputs = model(input_ids=b_src_input_ids,
attention_mask=b_src_attention_mask,
labels=labels,
decoder_attention_mask=b_tgt_attention_mask)
loss = outputs[0]
train_loss += loss.item()
# backward pass
loss.backward()
# update weights
optimizer.step()
# update scheduler
scheduler.step()
if step in validate_at_steps:
print(f'-- Step: {step}')
_ = val(model, val_dataloader, criterion)
avg_train_loss = train_loss / len(train_dataloader)
print('Training loss:', avg_train_loss)
there is no error when it stops. how do i figure out whats wrong please?
Epoch 0: 14%|█████████▏ | 2179/15957 [23:21<2:27:46, 1.55it/s]
stays like that forever
Appreciate any help please