I am referring to eriklindernoren/Action-Recognition now. The UCF101 dataset has “one action” for each video. Therefore, there is no problem with the following model output.
# output_layers of models.py
self.output_layers = nn.Sequential(
nn.Linear(2 * hidden_dim if bidirectional else hidden_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim, momentum=0.01),
nn.ReLU(),
nn.Linear(hidden_dim, num_classes),
nn.Softmax(dim=-1),
)
Now, I have data with multiple actions for one video. How can I train a dataset that has multiple action labels for frame by frame of this one video? I know that I have to extend one dimension from the above model. Please let me know if you know any existing implementations or repositories that might be helpful to me.