Custom Dataloader/ dataset to load several samples at once

Hi,
I’m new using PyTorch. I am implementing and testing a new paper called Sound of Pixels.

In short it’s a net which works with a 2-tower stream. One tower is fed with a stack of images and the other one is fed with audio spectrograms.
I’m using a private dataset, in which each sample is a numpy binary file which contains a python dictionary with both, audio and images.

I’ve created my own dataset subclass in which I’m enlisting all the files inside a folder and using this list to call
__getitem__ function.
The matter is I have to combine features of N samples (since the net is trained by summing audio samples). As far as I understand, I have to combine these N samples inside the __getitem__ function. However, this function is designed to deal with one sample and pass the info to the dataloader.

I considered the option of doing a post-processing of the batch doing what I need or making my own dataloader using one of the implemented samplers. Before doing that I would like to know if there is a more efficient way of dealing with this in pytorch because I have to compute fourier transforms of audio and I don’t want to bottleneck the training with a bad data preprocessing.
Thank you very much

Dataset class designed for just one sample

class BinaryData(torch.utils.data.Dataset):

    def __init__(self, root_dir,transform):
        self.input_list = []
        self.transform = transform
        for path, subdirs, files in os.walk(root_dir):
            for name in files:
                self.input_list.append(os.path.join(path, name))

    def __len__(self):
        return len(self.input_list)

    def __getitem__(self, idx):
        dic = data2dic(np.load(self.input_list[idx]))
        audio = dic['audio']
        audio = Sound_standarize(audio)
        frames = dic['frames']
        size = np.shape(frames)
        images = []
        for i in range(size[3]):
            images.append(self.transform(Image.fromarray(frames[:,:,:,i])))
        frames = torch.stack(images)
        return audio,frames 
2 Likes

Hi, this was not a pytorch framework problem.
If I understood correctly how to speed up your post-processing during the download?
I suggest you use a multi-thread approach.

http://chriskiehl.com/article/parallelism-in-one-line/

Well, It’s not a matter of post-processing or speed up. In sample words:

I have ie 1000 files (dataset).
I want that my dataloader opens (ie) 2 of these files (randomly) and sum them. This sum is what i will feed the net with, so in the end my net will percive there are 1000/2 = 500 samples. A batch have to be done of sums (but not raw files)

The matter is I don’t know what is the proper way to do this in pytorch. I can roughly emulate this behavior but I would like to know if pytorch framework is prepared to do this is a easy way.

I’m not sure, if this covers your whole use case, but you could just get the desired number of samples from your dataset and process them in __getitem__.
Here is a small example:

class MyDataset(Dataset):
    def __init__(self, nb_samples):
        self.images = torch.randn(100, 3, 24, 24)
        self.audio = torch.randn(100, 1024, 12)
        self.nb_samples = nb_samples
        
    def __getitem__(self, index):
        # Load all nb_samples
        images = self.images[index*self.nb_samples:index*self.nb_samples+self.nb_samples]
        audio_specs = self.audio[index*self.nb_samples:index*self.nb_samples+self.nb_samples]
        
        # Transform
        
        # Sum samples
        x1 = torch.sum(images)
        x2 = torch.sum(audio_specs)
        
        return x1, x2
        
        
    def __len__(self):
        return len(self.images) / self.nb_samples - self.nb_samples


dataset = MyDataset(2)
img_sum, audio_sum = dataset[0]

Currently adjacent samples are used. Let me know, if you need another strategy.

4 Likes

Oh in fact it seems perfect, Thank you very much

@ptrblck, thank you for an example. My dataset is “endless” and stored externally, thus, unfeasible to store in the class instance. So currently __getitem__() retrieves just one sample from the external database.

I’m concerned with the high overhead for this sample-by-sample querying; for batch size of e.g. 64, __getitem__() gets called 64 times consequently, doesn’t it?

I thought I could build a cache in MyDataset, fetching and storing bunches of samples in the class instance before returning them individually via __getitem__(). Is there a better way / better place to implement it?

Yes, you are right.
If you are concerned about the overhead of the (DB) query, you could create the whole batch directly in __getitem__ (with a single query) and set batch_size=1 in your DataLoader.
Would this work for you?

4 Likes

Sure. Thanks! I just didn’t dare to, wondering instead why I need to torch.unsqueeze() single samples into tensors with singleton outer dimension prior to returning them from __getitem__().

Without unsqueeze(), I was getting the famous RuntimeError: Expected 4-dimensional input for 4-dimensional weight [4, 1, 3, 3], but got 3-dimensional input of size [64, 32, 32] instead from the very first layer, conv2d (3x3 to four features). The input is a 32x32 gray-scale image, batch_size is 64.

This should usually not be necessary since the DataLoader should take care of this.
Did you get an error if you didn’t unsqueeze the data sample?

Thank you for this nice answer! Is there a way to do the same when you do not have self.images, but instead loading random seq of images from the folder. Before that, I used Image.open() from PIL to open an image in getitem. Now I want to do the same but for the seq of images (let’s say 5 images) each time. So the shape of the batch would be (8, 5). It seems like a not optimal solution to load images in a loop with Image.open()

I’m not aware of a “batched” open method for multiple image files, so I think you would need to use the loop. However, note that you could speed up the image loading by replacing PIL with PIL-SIMD. You would still need to execute the loading in a loop, but might lower the duration.

I will check the performance of PIL-SIMD. For now, I solved the problem by using imread_collection from skimage.io. In principle, everything works properly but I face the problem of slow DataLoader. I use such a code:

    def __getitem__(self, idx):
        idx_q = int(torch.randint(0 + self.boundary, self.length - self.boundary, (1,))) 
        idx_n = int(torch.randint(0 + self.boundary, self.length - self.boundary, (1,)))

        q = imread_collection([self.image_paths[idx_q-1], self.image_paths[idx_q], self.image_paths[idx_q+1]], conserve_memory=True)
        p = imread_collection([self.image_paths_winter[idx_q-1], self.image_paths_winter[idx_q], self.image_paths_winter[idx_q+1]], conserve_memory=True)
        n = imread_collection([self.image_paths_winter[idx_n-1], self.image_paths_winter[idx_n], self.image_paths_winter[idx_n+1]], conserve_memory=True)
        
        if self.transform:
            q = torch.stack([self.transform(img) for img in q])
            p = torch.stack([self.transform(img) for img in p])
            n = torch.stack([self.transform(img) for img in n])

        return q, p, n

So, for batch_size=4 I load 4x3x3 images which are 36 images. When I try to run all this code from getitem out of torch’s DataLoader it takes around 0.3 sec to run. But with DataLoader it takes 13 seconds which is a huge difference. Of course, num_workers help here when there are more iterations. But I don’t know where is the bottleneck. If you have any thoughts, that could be very helpful.

How large is your batch size when you are using the DataLoader?
Each worker of the DataLoader will create a full batch before the DataLoader loop is executed, so in particular the first iteration might be slow, since all workers would start collecting all samples. Once the workers are done, the ready batches will be added to a queue and the workers will start creating the next batch while the training loop is executed, which might reduce the data loading time assuming that the actual model training is not tiny compared to the data loading.

I checked for batch size = 4. If I run such a code:

batch_size = 4
trainloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, num_workers=4)
dataiter = iter(trainloader)
for i in range(8):
    start = time.time()
    q, p, n = dataiter.next()
    q, p, n  = q.to(device), p.to(device), n.to(device)
    end = time.time()
    print(end - start)

I will get such time results:

10.58901309967041
1.8810162544250488
0.009713411331176758
0.7070198059082031
10.15380048751831
2.085867404937744
0.009247779846191406
0.009662628173828125

I mean this is ok compared to one worker, but still, I don’t know what is happening inside the loader so that it needs 10 seconds instead of 0.3. Maybe it is because workers collecting all samples as you explained, but for one worker problem stays

Btw for batch size 8:

23.229380130767822
0.024968624114990234
0.023113250732421875
1.4580109119415283
24.699864387512207
0.018816709518432617
0.01724529266357422
0.017206192016601562

So it grows linearly

The periodical slowdown every num_workers iterations points towards the aforementioned loading of a full batch in each worker.
In your current script you don’t have any workload, so it’s expected that you would see slower iterations once the queue is empty.

The timings would refer to:

10.58901309967041 # no batches ready in the queue, all 4 workers are preloading
1.8810162544250488 # Grab next batch from queue. Since this is slower than the next step the queue might not have been filled yet.
0.009713411331176758 # grab next batch from queue
0.7070198059082031 # grab next batch from queue
10.15380048751831 # Queue is empty, as no worker was able to load a full batch yet

If your model training is sufficiently large, the DataLoader might be able to preload the next batch(es) while your model is training.

1 Like

I see. Thank you for the help!