Data-dependent network - problems in dataloader and train loop

hey everyone:
I am doing a binding affinity predicting model,

firstly, my dataset load the training data,
and conduct some precessing In the getitem() function,
where it return a data-dependent var tf_x

class MyDataset(Dataset):
    def __init__():
        ...
    def __getitem__(self, idx):
        ...
        # the tf_x is various, may consist of 1 or 2 or 3... 20 
        # numbers of sequences
        return DNA, tf_x, binding_label
    def __len__(self):
        ...

Secondly, my network take every tf in tf_x for a cnn layer, and average the output

class MyNetwork(nn.Module):
    def __init__():
        ...
    def forward():
        for index, tf in enumerate(tf_x.unbind(dim=-1)):
            if index == 0:
                # custom convolution layer, 
                # I just use nn.Conv2d instead of it here
                conv_out_sum = nn.Conv2d(tf) 
            else:
                conv_out_sum += nn.Conv2d(tf) 
        conv_out /= len(tf_x)
        ...
        return pred

now during model training,
model usually take mini batch in dataloader to train,
but the network may encounter different number of tf_x, how can I fix it?

another trouble is that, since tf_x is various, dataloader usually using torch.stack() to collect data, but it is not work with different shape of tensors. It may solved by collate_fn, but I am not really get how

and I have tried using batch_size = 1, it works but using 1 as batch_size would cause norm layer error, and may consume more time to get a worse model

is there any better way to achieve it?

any suggestion would be much appreciated!

You could try to batch samples according to their length or you could also pad them. @vdw shares some great resources here from the NLP domain, which deals with similar challenges.

1 Like

thanks for your reply!
i have been stuck by this problem for long
as you mentioned, i also tried to pad them according to the max length of everybatch by collate fu, but when using MyNetwork the iteration for each tf may cause problem, as well as the average step using the length of tf_x.

i will check the link you post, thanks alot! i also have learned so much in you previous reply for everyone’s questions

if any other ideas pop in to you, please keep me informed!

thanks

1 Like

for documentation and others seeing the post

I solved my problem by what I mentioned in the previous reply.

although, in my question, rnn padding is no the best to integrate in the collate_fn, I used a dynamic padding strategy and also return a mask tensor in collate_fn

def collate_fn(batch): # something like
    DNA_tf_x, targets= zip(*batch)
    DNA_x, tf_x = zip(*DNA_tf_x)
    max_size_dim1 = max(tf.size(1) for tf in tf_x)
    tf_padded_x = [torch.cat((tf, torch.zeros(tf.size(0), max_size_dim1 - tf.size(1))), dim=1) 
                  if tf.size(1) < max_size_dim1 else tf for tf in tf_x]
    mask_tensor = [torch.tensor([False] * tf.size(1) + [True] * (max_size_dim1 - tf.size(1))) for tf in tf_x]
    return (torch.stack(DNA_x), torch.stack(tf_padded_x).long()), torch.stack([ torch.tensor(i) for i in targets]), torch.stack(mask_tensor)

and in my network, I stack every conv out and using mask to fill the unwanted position, so that I can get real average output

class MyNetwork(nn.Module): # something like
    def __init__():
        ...
    def forward():
        conv_out_list = []
        for tf_1, mask_1 in zip(tf_x.unbind(dim=-2), mask.unbind(dim=-1)):
                conv_out_list.append(custom_conv1d) 
                # where tf_1 will be transform to a kernel, 
                # and some tf_1 are generate by padding rather than valid tf
                # and if enable MaskedBatchNorm1d, it would be like this
                conv_out_list.append(mask_bn(custom_conv1d), mask_1)

        # stack all conv_out by last dim
        conv_out = torch.stack(conv_out_list, dim=-1)

        mask_true = torch.sum(mask, dim=-1)

        # to avoid 0 in the mask_true
        mask_true = mask_true.masked_fill(mask_true==0, 1)
        mask = mask.unsqueeze(1).unsqueeze(2) 

        # fill the padding by mask, so in every data in bs, if the tf_1 is padding, the output should be 0
        # like [bs, :, :, 1]
        conv_out = conv_out.masked_fill(mask,0)

        # take average of the conv_out
        conv_out = torch.sum(conv_out, dim=-1)
        conv_out = torch.div(conv_out, mask_true.view(-1,1,1))  


        return conv_out

and the batch norm also need mask, in my question, I need to block batch-level data after conv_out generate, so I modified a masked batch norm layer posted 2y ago

class MaskedBatchNorm1d(nn.BatchNorm1d):
    def __init__(self, num_features, eps=1e-5, momentum=0.1,
                 affine=True, track_running_stats=True):
        super(MaskedBatchNorm1d, self).__init__(
            num_features,
            eps,
            momentum,
            affine,
            track_running_stats
        )

    def forward(self, inp, mask):
        self._check_input_dim(inp)
        exponential_average_factor = 0.0
        n = mask.sum()
        if n.item() != 0:
            mask = mask / n
            mask = mask.unsqueeze(1).unsqueeze(1).expand(inp.shape)            
            process_inp = inp * mask
        else:
            process_inp = inp

        if self.training and self.track_running_stats:
            if self.num_batches_tracked is not None:
                self.num_batches_tracked += 1
                if self.momentum is None:  
                    exponential_average_factor = 1.0 / float(self.num_batches_tracked)
                else:  
                    exponential_average_factor = self.momentum

        if not self.track_running_stats:  # Should raise an exception if n==1
            mean = (process_inp).sum([0, 2])
            var = ((process_inp ** 2).sum([0, 2]) - mean ** 2) * n / (n - 1)
        elif self.training and n > 1:
            mean = (process_inp).sum([0, 2])
            var = (process_inp ** 2).sum([0, 2]) - mean ** 2
            with torch.no_grad():
                self.running_mean = exponential_average_factor * mean\
                    + (1 - exponential_average_factor) * self.running_mean
                self.running_var = exponential_average_factor * var * n / (n - 1)\
                    + (1 - exponential_average_factor) * self.running_var
        else:
            mean = self.running_mean
            var = self.running_var

        inp = (inp - mean[None, :, None]) / (torch.sqrt(var[None, :, None] + self.eps))
        if self.affine:
            inp = inp * self.weight[None, :, None] + self.bias[None, :, None]

        return inp

I do not check the code explicitly, if anyone would check and if there is anything goes wrong, plz inform me!
thanks

@ptrblck
:melting_face: if you can have a look ~

thanks!

I’m not entirely sure how your padding and masking works, but it seems your samples differ in their length in dim1? Does it mean the samples have a different number of channels and you want to compute the right stats (e.g. in norm layers) using only the valid channels while ignoring the padding?

I have modified the reply

https://discuss.pytorch.org/t/data-dependent-network-problems-in-dataloader-and-train-loop/201219/4?u=cmf1997

and I have tried to train the model by a mini_dataset,
the model would overfit when I not using MaskedBatchNorm1d as expected,
but using MaskedBatchNorm1d, the train loss would fluctuate in 0.69, I believe something must go wrong in MaskedBatchNorm1d, but I did not work it out yet

@ptrblck
if you can have a look ~

any suggestion would be much appreciated!

I have worked it out! thanks for ptrblck!

for clarification,
in the previous reply
there is a mistake in the Mynetwork forward function ,

mask_true = torch.sum(mask, dim=-1)
mask_true = mask_true.masked_fill(mask_true==0, 1)

it should be modified to

mask_true = torch.sum(~mask, dim=-1)
1 Like

Great catch and thanks for sharing the fix!