How to extract features from a given video using I3D?

grizzlycoder · October 14, 2020, 1:49pm

I’m trying to extract features using a pretrained I3D model available in this repo: https://github.com/piergiaj/pytorch-i3d

I don’t have the Charades dataset with me and as I’m trying to run my code through colab, the 76 GB size stops me from using Charades directly.

Instead, I would like to take a random video -> apply I3D -> extract features -> show classification.

Essentially, I want to do something like this: https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/action_recognition_with_tf_hub.ipynb but for pytorch.

Can someone explain how I would go about running extract_features.py for my own video (i.e. how I should update the Dataset module)? Further, I want to start from a video, so I am also a bit unsure about how to convert a video into rgb frames/ optical flow frames.

Thank you!

a_d · October 14, 2020, 3:15pm

I generally use the following dataset class for my video datasets. It essentially reads the video one frame at a time, stacks them and returns a tensor of shape num_frames, channels, height, width
Here is my implementation of the class…

class customVideoDataset(Dataset):
    def __init__(self, path, frame_count):
        self.videos = []
        self.labels = []
        self.frames = frame_count
        folder = Path(path)
        for label in sorted(os.listdir(folder)):
            for fname in os.listdir(os.path.join(folder, label)):
                self.videos.append(os.path.join(folder, label, fname))
                self.labels.append(label)

        self.label2index = {label: index for index, label in enumerate(sorted(set(self.labels)))}
        self.label_array = numpy.array([self.label2index[label] for label in self.labels], dtype=int)

    def __getitem__(self, idx):
        video = cv2.VideoCapture(self.videos[idx])
        stacked_frames = numpy.empty(shape=(self.frames, 32, 32, 3),
                                     dtype=numpy.dtype('float16'))  # as frame would have shape h,w,channels
        frame_count = 0
        while video.isOpened() and frame_count<self.frames:
            ret, frame = video.read()
            if not ret:
                break
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frame = cv2.resize(frame, (32, 32))
            stacked_frames[frame_count] = frame
            frame_count += 1
        video.release()
        stacked_frames = stacked_frames.transpose((3, 0, 1, 2))

        return stacked_frames, self.label_array[idx]

    def __len__(self):
        length = len(self.videos)
        return length

Hope it helps…

grizzlycoder · October 14, 2020, 4:02pm

Thank you so much! How are your videos stored? (Directory and naming wise)

a_d · October 14, 2020, 5:15pm

Well if it is for classification, it is root_dir/class1, root_dir/class2 and so on.
I generally leave it in the default structure, i.e. The way it is once you download and extract.

Sevakram_Kumbhare · July 6, 2023, 12:49pm

Are you able to extract the features? If so, can you please share the code snippet?