Custom batching in DataLoader based on a mask

pthelp · December 5, 2020, 8:18pm

Hello, I’m new to PyTorch and I apologize if this is a stupid question, but I am really stuck with this problem. I have a Dataset created from Numpy objects X and y, and I want to create a DataLoader to pass batches of data to my model. I have another Numpy array users, with the same length as X and y, which tells me which data instance comes from which user (think an array like [0, 0, 0, 1, 1, 2, 2, 2,...]).
I want to batch the dataset such that each batch contains all instances from one user. The number of instances per user can vary, so the batch size is not fixed. I cannot figure out how to do this, but I feel like there should be an easy way that I’m just missing (is this what the batch_sampler argument is for?). Can anyone help with some pointers or a toy example?

Edited to clarify: There are multiple (X,y) pairs for each user in the dataset, and I want to assign all (X,y) pairs belonging to user 1 to batch 1, user 2 to batch 2, and so on. Hope that helps!

nicofish · December 5, 2020, 8:33pm

DataLoader takes as parameters batch_size = foo. As such it returns batches of batch_size. The exception to this being the final batch. this happens if the length of your data_set is not divisible by the batch_size.

perhaps you may want to structure your data differently

consider this shape for your batches:
[batch_size, seq_length, input_size]

batch_size = number of people per batch this is fixed
seq_length = number of instances from that user
input_size = length of those instances

Depending on what model you are using you may need to pad your seq_length and input_size so that they are all of the same length.

example:
you have dataset of people names
bob
alice
edward

where each row corresponds to one person.

In this case:
input_size: is the length of your one hot encode for the name and
seq_length: is the length of the name
batch_size: is the number of people you will process at one time. This is fixed.

You did not provide very many details so this is the most I can do to help.

edit: example with names I gave did not account for encoding which would be necessary.

pthelp · December 5, 2020, 9:00pm

Thanks @nicofish, my question was probably unclear, sorry about that. What I want to do is to have variable batch sizes in the DataLoader. I want to assign all X,y pairs belonging to user 1 to batch 1, user 2 to batch 2, and so on. The seq_length and input_size are fixed for each X, so that is not an issue.

nicofish · December 5, 2020, 9:54pm

What is your end goal? When modeling, if you are using PyTorch to handle gradient descent, the model parameters will only be updated once per batch.

When you batch the data each batch has time steps equivalent to the batch size. Hence the model would be handling each person individually, and then would be updating all of the gradients once per batch at the end.

What you are doing is the same as using batch size = 1. In other words your gradients would be updated after every person. This is more computationally costly, and you lose some of the benefits from batching.

see last image here:

observe how the accuracy in training is more erratic with lower batch size.

For more help please provide sample data, and explain what you are trying to accomplish.

pthelp · December 5, 2020, 10:18pm

I am trying to implement multiple instance learning, as described in this paper. Each batch (user in my case) still has multiple instances, though the number of instances may be different across users (which means the batch size is not fixed). The gradients would be updated after every person, but after multiple instances, so the learning should be stable.

For some minimal sample data, assume the dataset is somewhat like this:

X = [[2,3], [1,1], [4,3], [6,2]]
y = [0, 1, 0, 1]
subjects = [0, 0, 1, 2]

then I want X to be batched based on subjects:

batch_0 = [[2,3], [1,1]]
batch_1 = [[4,3]]
batch_2 = [[6,2]]

In my actual dataset, there are about 15 instances per subject, so the batch size will not be too small.
Thank you!

nicofish · December 5, 2020, 10:23pm

In that case don’t use DataLoader. Batch the data yourself using a for loop and then feed each batch into the model one batch at a time with another loop.

do something like:

sudo code

for i in number of epochs:
    for j in number of subjects:
        index = which(X == j)
   
        #your model training function
        data = x[index]
        target = y[index]
        output = model(x)
        #compute your loss
        #do your back propagation
        #get your training and or validation loss
        #save model at epoch j

pthelp · December 5, 2020, 10:28pm

I think that’s what I’ll fall back to, thanks. I was hoping there would be a more efficient way to do this using DataLoader - assigning individual samples to custom batches based on some criteria feels like it should not be such a unique use case.

nicofish · December 5, 2020, 10:50pm

Its not that much less efficient. Consider that with DataLoader you still have to use a loop as it creates and iterable.
so it would still look like:

for i in number epochs:
    for j, dat in enumerate(DataLoader):
        data = dat[0]
        target = data[1]
        #all the other training stuff

I also think the reason DataLoader isn’t necessary in this case is that you are simply doing batch size 1. That is why it doesn’t work.

Batch is a third dimension that you are adding to your data to group things together, in your case multiple people.

In the particular example you provided your subject 0 has data with sequence_length = 2 and input size = 2 (as you have x, y coordinates). I’m assuming the y isn’t your target data.

For example, if you are feeding data to an RNN or LSTM with batch_first = True. The data should have the following format:
[batch_size, seq_length, input_size]

The shape of your input data for subject 0 would be [1, 2, 2] and the shape for subject 1 would be [1, 1, 2]. As you can see batch size = 1.

I think you confused your sequence length (number of instances per person) with batch size. The same thing happened to me when I began with PyTorch.

Just remember each your people is a 2D matrix of shape (instances, a point). batch size is the number of 2D matrices you give to your model before performing gradient descent.