Recurrent network overfitting on train data but validation accuracy does not improve

Hi,

I need help with a recurrent model based on this paper: Compostional attention network for Machine Reasoning

I am trying to use this network with video question answering dataset, and my network overfits the small subset of training data, but on validation accuracy does not seem to improve at all. Here is the training log:

Epoch: 1; Loss: 1.80709; Acc: 0.20131: 100%|██████████| 381/381 [09:34<00:00,  1.46s/it]
100%|██████████| 47/47 [00:56<00:00,  1.66it/s]
Avg Acc: 0.16888
Epoch: 2; Loss: 1.73435; Acc: 0.22511: 100%|██████████| 381/381 [09:40<00:00,  1.51s/it]
100%|██████████| 47/47 [00:56<00:00,  1.66it/s]
Avg Acc: 0.18750
Epoch: 3; Loss: 1.60537; Acc: 0.27834: 100%|██████████| 381/381 [09:41<00:00,  1.46s/it]
100%|██████████| 47/47 [00:53<00:00,  1.66it/s]
Avg Acc: 0.17287
Epoch: 4; Loss: 1.41215; Acc: 0.34002: 100%|██████████| 381/381 [09:38<00:00,  1.45s/it]
100%|██████████| 47/47 [00:54<00:00,  1.72it/s]
Avg Acc: 0.18351
Epoch: 5; Loss: 1.53825; Acc: 0.42712: 100%|██████████| 381/381 [09:39<00:00,  1.70s/it]
100%|██████████| 47/47 [00:56<00:00,  1.62it/s]
Avg Acc: 0.18218
Epoch: 6; Loss: 1.23363; Acc: 0.53125: 100%|██████████| 381/381 [09:44<00:00,  1.46s/it]
100%|██████████| 47/47 [00:55<00:00,  1.67it/s]
Avg Acc: 0.18218
Epoch: 7; Loss: 0.84187; Acc: 0.61812: 100%|██████████| 381/381 [09:40<00:00,  1.47s/it]
Avg Acc: 0.18218

Training loss :arrow_down: and accuracy :arrow_up: both are improving, but no luck on validation accuracy.:sweat:
The network uses dropout and gradient clipping to avoid overfitting, I also used grad clipping for lstm and recurrent part of the network. Any leads on this issue?
lr=1e-4, decay=0.999, Adam optimizer with amsgrad=True.
In my case, I am using lstm for both question and answers. I3D is being used as feature extractor here.
Any pointers will be highly appreciated.