How to make my model converge as the loss is not going below 2

I am building a medical report generation model which is a variant of encoder decoder model equipped with both visual and semantic attention. It is somewhat similar to [1711.08195] On the Automatic Generation of Medical Imaging Reports. i have also calculated the loss as specified in the paper batch_tag_loss = criterion_tag(tags, Variable(labels,requires_grad=False)).sum()

topics, ps = sentLSTM(vis_enc_output, mlc_output, captions, args.device)

    loss_sent = criterion_stop (ps.view(-1, 2), prob.view(-1))

    loss_word = torch.tensor([0.0]).to(args.device)

    for j in range(captions.shape[1]):
        word_outputs = wordLSTM(topics[:, j, :], captions[:, j, :])

        loss_word += criterion_words(word_outputs.contiguous().view(-1, vocab_size), captions[:, j, :].contiguous().view(-1))

    loss = args.lambda_tag * batch_tag_loss+ args.lambda_sent * loss_sent + args.lambda_word * loss_word

trying to minimize the loss but even after 150 epochs the loss doesnot go below 2. What can be the reason behind it? I might be doing something wrong but cannot figure out. Can anyone please help