My question answer model training is taking too long

my question answer nlp model is taking too much time to run , while it is not GPU is being unutilised and also my loss is about 4-3 million
i’m having 13gb ram and 15 gb graphics card from google colab. Please help me!

it is my training code

from import tqdm
epochs = 5
# training
for epoch in tqdm(range(epochs)):
  train_loss = 0
  for batch in train_dataloader:
    x,y= batch[0],batch[1]
    x,y =,
    x,y =,

    # perform forward pass
    logits = model_0(x)

    # calculate loss
    loss = loss_fn(logits, y)
    train_loss += loss

    # optimizer zero grad

    # loss back

    # step

  # testing
  with torch.inference_mode():
    test_loss = 0
    for batch in test_dataloader:
      x,y = batch[0], batch[1]
      x,y =,
      x,y =,

      # forwrad pass
      logits = model_0(x)

      # calc loss
      loss = loss_fn(logits, y)
      test_loss += loss
      test_loss /= len(test_dataloader)

  # print some info
  train_loss /= len(train_dataloader)
  print(f"epoch : {epoch}  || train_loss : {train_loss:.4f}  || test_loss : {test_loss:.4f}")

Now i will show model

# creating the model
from torch import nn

class s2s_model_v0(nn.Module):
  def __init__(self,input_size,hidden_size,output_size):
    self.ln = nn.LayerNorm(normalized_shape = input_size)
    self.encoder = nn.GRU(input_size = input_size,hidden_size = hidden_size, batch_first = True)
    self.ln2 = nn.LayerNorm(normalized_shape = hidden_size)
    self.decoder = nn.GRU(input_size = hidden_size,hidden_size = hidden_size, batch_first = True)
    self.ln3 = nn.LayerNorm(normalized_shape = hidden_size)
    self.linear2 = nn.Linear(in_features = hidden_size, out_features = hidden_size) # also i doubled no. of linear layers
    self.linear = nn.Linear(in_features = hidden_size, out_features = output_size)

  def forward(self,x):
    x = self.ln(x)
    encoder_ouput, enc_hidden_state = self.encoder(x)
    encoder_ouput = self.ln2(encoder_ouput)
    decoder_ouput ,_ = self.decoder(encoder_ouput, enc_hidden_state)  
    decoder_ouput = self.ln3(decoder_ouput)
    final_ouput = self.linear(self.linear2(decoder_ouput))
    return final_ouput

please help me

You are accumulating the entire computation graph in:

train_loss += loss

so either detach() the loss tensor or call item() on it as it would not only slow down your code (as more memory would be required in each iteration) but could also yield an out-of-memory error.

Besides that you might need to profile you code (e.g. via Nsight Systems) to check where the bottleneck is.

Thank you for your rply but still my code is as slow as before and also its loss is about 3-4 million as before, i am not even being able to use the complete dataset . i’m using only about 320 samples out of 80,000 but still it takes about an hour to work. i have also seen that the forward pass is taking the most time.

i’m also confused that why it is taking only time but not using much ram or gpu, it only uses 5gb ram out of 12 gb and only 1gb gpu out of 15gb

i mean it should take more compute resources and do the task faster but it is not doing so

please help me