How can I train action recognition for frame by frame of video using pytorch?

I am referring to eriklindernoren/Action-Recognition now. The UCF101 dataset has “one action” for each video. Therefore, there is no problem with the following model output.

        # output_layers of models.py
        self.output_layers = nn.Sequential(
            nn.Linear(2 * hidden_dim if bidirectional else hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim, momentum=0.01),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes),
            nn.Softmax(dim=-1),
        )

Now, I have data with multiple actions for one video. How can I train a dataset that has multiple action labels for frame by frame of this one video? I know that I have to extend one dimension from the above model. Please let me know if you know any existing implementations or repositories that might be helpful to me.