Pass extra arguments to __getitem__

Hi

I have implemented a custom dataset class where images are retrieved based on numerous object attributes. This, unfortunately, doesn’t work when I employ multiple workers. I am not at all certain that this is the right way of implementing this type of functionality. I’d appreciate any help. Here is some pseudocode:

loader = dataloader(custom_dataset)
for X1, y, indexes in enumerate(loader): # where indexes are the indexes of the loaded batch
    pass X1 into a model and get some values V1
    update custom_dataset with V1 for current indexes
    retrieve X2 using __getitem__(indexes) 
    pass X2 into a model and get some values V2
    update custom_dataset with V2 for current indexes
    retrieve X3 using __getitem__(indexes)
    pass X3 into a classifier, get y hat, compute the loss, do a backpropagation

Unfortunately, I cannot preprocess the data a priori and hence there needs to be sequential data retrieval from the same images. If there was some way of passing the values V1, V2, V3 into getitem, I think multiprocessing would work. Let me know if anything is unclear! Thank you!

1 Like

You could probably create a custom sampler and try to pass more arguments to __getitem__.
However, what changes are you applying to your custom Dataset?
Note that each worker will work on a copy of the Dataset, so manipulating it is not trivial and you might need to use shared arrays/dicts etc.

Thanks for replying.

However, what changes are you applying to your custom Dataset ?

The images that I am dealing with are too large to process at once and hence each time I can only see a glimpse/patch of the original image. The ‘changes’ are simply the new coordinates of the next glimpse.

Note that each worker will work on a copy of the Dataset , so manipulating it is not trivial and you might need to use shared arrays/dicts etc.

My current implementation assumes that there is only one copy which is why issues arise when I use multiple workers. It would be really helpful if you could elaborate on how I could implement shared lists. I will look into the custom sampler and get back to you :slight_smile: Thanks!

You could try to adapt this example, but I think the proper approach would be to create or read the indices either in the sampler or the Dataset without the need to manipulate it. This shared array approach sounds more like a hack, so what exactly are you changing in the dataset?
If you are using two indices, one for the data sample and another one for the coordinates, would it be possible to create a list of indices of these pairs and increase the length of the dataset to sample from them?

You could try to adapt this example

Excellent! Thank you! I will keep you updated.

but I think the proper approach would be to create or read the indices either in the sampler or the Dataset without the need to manipulate it.

For clarity, an index is simply the ID of an image which I use in conjunction with __getitem__ to get the right image. The issue comes with V{1, 2} as described in my initial pseudocode. If by indexes you are referring to this V, then what you suggest cannot be generated in the Dataset (or sampler) as they are predictions made by a neural network.

If you are using two indices, one for the data sample and another one for the coordinates, would it be possible to create a list of indices of these pairs and increase the length of the dataset to sample from them?

I am not sure I am following 100% but I will stick to my previous answers and provide more background information. The input images are huge (lookup histopathology images), and as such, they cannot be pre-processed to accommodate for all the possible values of V{1, 2}. Ideally, this should be done dynamically, i.e. only access the part of an image that the model desires to observe. Since I’d like the model to have multiple ‘chances’ of looking at an image, I need to access an image multiple times but each time looking at a specific part of it as defined by V.

In a simpler world, I would be able to store this image in memory, and use indexes to directly extract these glimpses, but unfortunately this is not possible with my data set (typical compressed size of each image is 3-8 GB). Hence, I use openslide (https://openslide.org/api/python/) to access my images, each time passing coordinates and magnification level (which is what I’ve defined as V above) to extract a specific patch of the image dynamically.

OK, I think I understood the use case better now. Thanks for the detailed explanation.
In that case the shared array approach might be the easiest solution.
Let me know how it goes.

1 Like

Just a follow-up: it worked! I was able to use multiple works with shared arrays. However, not only was there no improvement in training time, it would be slightly slower.

The code I used:

import multiprocessing as mp
import ctypes
import numpy as np

def create_shared_np_float(value, args):
    shared_array_base = mp.Array(ctypes.c_float, int(np.prod(args))) 
    shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())
    shared_array = shared_array.reshape(*args)
    shared_array.fill(value)
    return shared_array

def create_shared_np_int(value, args):
    shared_array_base = mp.Array(ctypes.c_int, int(np.prod(args))) 
    shared_array = np.ctypeslib.as_array(shared_array_base.get_obj())
    shared_array = shared_array.reshape(*args)
    shared_array.fill(value)
    return shared_array