How upload sequence of image on video-classification

Let’s focus first on the training folder and use the same approach to test and eval, if possible.
So in your training folder you have 6 different action folders, which represent different classes.
For example, boxing would be class0, and jogging class1.
In each of these “action (class) folders” you have 25 subfolders with frames from different persons performing the current action.
As far as I understand you don’t want to mix up the frames of different persons, i.e. you would like to get sequences of a single action from a single person. The next sequence might have another single action from another single person.
Is this correct?

EDIT: Do you want each sequence to have the same length, e.g. 10 images?
If so, do you want a sliding window approach, i.e.:

batch0: box_person0_image0, box_person0_image1, box_person0_image2, ... box_person0_image9
batch1: box_person0_image1, box_person0_image2, box_person0_image3, ... box_person0_image10

or rather:

batch0: box_person0_image0, box_person0_image1, box_person0_image2, ... box_person0_image9
batch1: box_person0_image10, box_person0_image11, box_person0_image12, ... box_person0_image19

In the folder of for ex. ‘boxing’ I have the folder of different people that describe the same action boxing. In a folder person that playing box I have different frame but in sequence.
I don’t know how load my data in the dataset.py (my class Dataset).Could you show me an example?
I know that I should have the images loaded sequentially, but how can I load them in blocks of images?


Is that correct?

To create a code sample I would need to know some more information mentioned in my last post.

Also do you want the sliding window approach or neighboring windows?

I’m sorry, I explained myself badly, I’m interested only in associating with each frame sequence the correct labels like ‘running’ etc …
I don’t care if individual people are recognized. I just need the people folder to keep the frame block single.
I’m only interested in processing frames in sequence to allow the algorithm to recognize the ‘running’ action for example.
I think is better a Sliding window approach.

Assuming your folder structure looks like this:

root/
    - boxing/
        -person0/
            -image00.png
            -image01.png
            - ...
        -person1
            - image00.png
            - image01.png
            - ...
    - jogging
        -person0/
            -image00.png
            -image01.png
            - ...
        -person1
            - image00.png
            - image01.png
            - ...

You could first get all image paths and the corresponding target.
Then we would have to take care of the invalid indices, i.e. images from different persons, as this might be problematic for the training.
Using a sampler, we can get all valid indices for the current sequence length.
Here is a code sample I adapted to your use case:


import os
import glob

import torch
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms

from PIL import Image


class MySampler(torch.utils.data.Sampler):
    def __init__(self, end_idx, seq_length):        
        indices = []
        for i in range(len(end_idx)-1):
            start = end_idx[i]
            end = end_idx[i+1] - seq_length
            indices.append(torch.arange(start, end))
        indices = torch.cat(indices)
        self.indices = indices
        
    def __iter__(self):
        indices = self.indices[torch.randperm(len(self.indices))]
        return iter(indices.tolist())
    
    def __len__(self):
        return len(self.indices)


class MyDataset(Dataset):
    def __init__(self, image_paths, seq_length, transform, length):
        self.image_paths = image_paths
        self.seq_length = seq_length
        self.transform = transform
        self.length = length
        
    def __getitem__(self, index):
        start = index
        end = index + self.seq_length
        print('Getting images from {} to {}'.format(start, end))
        indices = list(range(start, end))
        images = []
        for i in indices:
            image_path = self.image_paths[i][0]
            image = Image.open(image_path)
            if self.transform:
                image = self.transform(image)
            images.append(image)
        x = torch.stack(images)
        y = torch.tensor([self.image_paths[start][1]], dtype=torch.long)
        
        return x, y
    
    def __len__(self):
        return self.length


root_dir = './video_data_test/'
class_paths = [d.path for d in os.scandir(root_dir) if d.is_dir]

class_image_paths = []
end_idx = []
for c, class_path in enumerate(class_paths):
    for d in os.scandir(class_path):
        if d.is_dir:
            paths = sorted(glob.glob(os.path.join(d.path, '*.png')))
            # Add class idx to paths
            paths = [(p, c) for p in paths]
            class_image_paths.extend(paths)
            end_idx.extend([len(paths)])

end_idx = [0, *end_idx]
end_idx = torch.cumsum(torch.tensor(end_idx), 0)
seq_length = 10

sampler = MySampler(end_idx, seq_length)
transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor()
])

dataset = MyDataset(
    image_paths=class_image_paths,
    seq_length=seq_length,
    transform=transform,
    length=len(sampler))

loader = DataLoader(
    dataset,
    batch_size=1,
    sampler=sampler
)

for data, target in loader:
    print(data.shape)

If you use the Dataset without the provided sampler, you will get invalid sequences, e.g. one part might come from person0 while the other from person1.

11 Likes

Would you please provide the code with neighboring windows?

Say my sequence length is 10, and say one of my video is of length 55 frames, I’d get 5 batches of 10 frames each, but how are the last 5 frames handled.

Also can someone point me out to detailed explanation of sampler?

Note that the posted code was for a very specifc use case, where different actions as well as persons were used.
Are you using a similar work flow or how is your data structured?

Yeah, a very similar work flow. I’m using a CNN + LSTM architecture with this kind of file structure. This dataloder does get me frames in sequence, but my doubt was, as I explained what if I’ve some 33 frames and my sequence length is 10. This sequence length yields me three sets of 10 frames each, but the last three frames are they discarded or repeated?

Also can you please point me towards some more detailed and up to date explanation of the sampler.

train/
    - action1/
        -vid1/
            -image00.png
            -image01.png
            - ...
        -vid2
            - image00.png
            - image01.png
            - ...
    - action2
        -vid1/
            -image00.png
            -image01.png
            - ...
        -vid2
            - image00.png
            - image01.png
            - ...
val/
    - action1/
        -vid1/
            -image00.png
            -image01.png
            - ...
        -vid2
            - image00.png
            - image01.png
            - ...
    - action2
        -vid1/
            -image00.png
            -image01.png
            - ...
        -vid2
            - image00.png
            - image01.png
            - ...

Is it because we are discarding a few indices when we do

for i in range(len(end_idx)-1):
            start = end_idx[i]
            end = end_idx[i+1] - seq_length
            print(start, end)
            indices.append(torch.arange(start, end))
        indices = torch.cat(indices)
        self.indices = indices

since if I’ve 55 frames in action1->vid1 then end_idx would be [0, 55, …] so when appending indices we append everything between 0-45, in indices.append(torch.arange(start, end)), because my end is 55-seq_length = 45, and thus I’m discarding last few frames, and thus my indices would only have [0,1,…,44, 55, 56, …]

but what happens when geitem , got an index of 44. then list(range(start , end)) would give [44, 45, 46, 47, 48, 49, 50, 51, 52, 53], but these indices are not there in self.indices in the sampler. but image_path = self.image_paths[i][0], we would get the image from the next video, right, because here 45 means the 46th entry in the image_paths folder?

def __getitem__(self, index):
        start = index
        end = index + self.seq_length
        print('Getting images from {} to {}'.format(start, end))
        indices = list(range(start, end))
        images = []
        for i in indices:
            image_path = self.image_paths[i][0]
            image = Image.open(image_path)
            if self.transform:
                image = self.transform(image)
            images.append(image)
        x = torch.stack(images)
        y = torch.tensor([self.image_paths[start][1]], dtype=torch.long)
        
        return x, y

I got it. Sorry. Since we didn’t append 45-54 in our indices, when we give list(range(start , end)) which would give [44, 45, 46, 47, 48, 49, 50, 51, 52, 53] are all still valid frames.

hi, i have to give a set of 8 video frames as input to the deep learning model. Can I do this in a similar way.? The output of 1st convolutional layer has to be of size (8*[11211264]). Can you hep me. I am new in this field

This will help you load multiple video frames at once.

Hey @ptrblck_de

Thanks for your help in most of the questions that I read to solve my problem in the past week.

I have a dataset from medical images that are extremely large (each of them almost [8900 * 8700* 3]), therefore we split them manually into some smaller patches ( ~ [300 * 300 * 3] ) with respect to some meannginful medical properties.

Now I have a separate folder for each patient that contains patches that are split from the original image. ( Folder 1 [image 0, image 1 ,…] folder 2 [image 0, image 1 ,… ])

Now I want to build a data loader and then pass my input to a CNN, but I confused after reading a lot of questions here.

Would you please help me with these questions?

  1. I should use ImageDataFolder or iterative Dataloder like you write here? Your above code can load my data?
  2. my patches are not in the exact same size, I should transform them after load in init function?

Many thanks for your help in advance.

  1. I would probably not use the ImageFolder dataset, as it would assign a new class label to each subfolder. If I understand your use case correctly, the folders contain the image patches, so you should implement the loading logic in a custom Dataset instead.

  2. You could keep these patches in the current shape and transform them in the __getitem__ of your custom Dataset. Of course, if you could recreate these patches in a constant shape, you could save the processing time during loading, but it depends on your actual use case and if you want to rerun the offline data creation step.

So you think that I should define iterative data loader?
How can I reconstruct the meaningful images from input images?
How do I separate patches from each other? For example patches of image 1 from image 2 and so on …

Yes, I think the cleanest way would be to define a custom Dataset.

I’m not sure what your use case is, but I understood that you’ve already split the input images into patches. If that’s the case, you could create a mapping and store the prediction for each patch and “combine” them afterwards. E.g. if you are working on a classification use case, you could use a majority voting etc.

Hi @ptrblck. First of all I would like to thank you for your very functional code (MyDataset), however I notice that it is slow and makes training very slow. Indeed here are the times for a pass in the network and for a pass in MyDataset (4 for batch of 4).

Time of one pass in the network on GPU (with update) :
0.025026798248291016s

Time of each pass in MyDataset so 4 for batch=4 (sequence length=32):

0.6877198219299316s
0.6699411869049072s
0.6670119762420654s
0.6949927806854248s

To obtain those times I have already made a modification, basically transforming the for loop into a list of comprehension:
images = [self.transform(Image.open(self.image_paths[i][0])) for i in indices]

How do you speed up your MyDataset code? Thanks!

How can I modify this code if all my sequences don’t have the same length ?