For load video data, I need to make clear some concepts.
In action recognition, A frame of video have one label,
and In Object detection, A frame of video have multiple labels (bboxes) ? Is it correct?
It depends on the setup. Sometimes in action recognition the video sequence has one label, i.e. multiple frames have are associated with one label.
Usually in object detection there can be multiple boxes in each frame. So yes, that’s correct.