Doubt regarding Implementation of Hierarchical Attention Network

Hey all,
I was reading this paper and came across a problem, during implementation.

During dataset creation, I created batches as 4 dimensional tensor as such( batch_size x document_size x sentence_size x embedding_size).

Now when I pass though the GRU, it says nn.GRU only accept 3D tensors. So my question is what changes should I make to the model or dataset creation?

Thanks in advance.