Make a TensorDataset and Dataloader with multiple inputs parameters

DeepLearner17 · October 5, 2018, 4:51pm

Hello,

I have a dataset composed of labels,features,adjacency matrices, laplacian graphs in numpy format.

I would like to build a torch.utils.data.data_utils.TensorDataset() and torch.utils.data.DataLoader() that can take labels,features,adjacency matrices, laplacian graphs.

To do so, l have tried the following


import numpy as np
import torch.utils.data as data_utils

# get the numpy data
 labels_train,features_train,adjacency_train,laplacian_train=train
 labels_test,features_test,adjacency_test,laplacian_test=test


# expand dimension
features_train=np.expand_dims(features_train,axis=0)
features_test=np.expand_dims(features_test,axis=0)

adjacency_train=np.expand_dims(adjacency_train,axis=0)
adjacency_test=np.expand_dims(adjacency_test,axis=0)

laplacian_train=np.expand_dims(laplacian_train,axis=0)
laplacian_test=np.expand_dims(laplacian_test,axis=0)

# convert  numy data to torch
labels_train=torch.from_numpy(labels_train)
features_train=torch.from_numpy(features_train)
adjacency_train=torch.from_numpy(adjacency_train)
laplacian_train=torch.from_numpy(laplacian_train)

labels_test=torch.from_numpy(labels_test)
features_test=torch.from_numpy(features_test)
adjacency_test=torch.from_numpy(adjacency_test)
laplacian_test=torch.from_numpy(laplacian_test)

# l get stuck here

train=data_utils.TensorDataset(features_train.float(),labels_train,adjacency_train.float(),laplacian_train.float())

test=data_utils.TensorDataset(features_test.float(),labels_test,adjacency_test.float(),laplacian_test.float())

It doesn’t work because it’s supposed to take only two parameters features_train and labels_train only.
Is there any way to accept more than two parameters ?

What is my purpose ?
Once l have get train and test from data_utils.TensorDataset() l would like to load my data as follow :

train_loader=data_utils.DataLoader(train)
val_loader= data_utils.DataLoader(test)

It doesnt’ work because DataLoader() is supposed to have only target and features_data (not adjacency matrices and laplacian).

I need this setup in order to do the following :

for i,(input,target,adjacency_matrix,laplacian) in enumerate(train_loader):
     # do my training

Thank you for you help

@smth, @ptrblck is there any cue ?

ptrblck · October 5, 2018, 5:08pm

What kind of error do you get or why is it not working?
The TensorDataset takes an arbitrary number of input tensors.
Here is a small example using just random data:

nb_samples = 100
features = torch.randn(nb_samples, 10)
labels = torch.empty(nb_samples, dtype=torch.long).random_(10)
adjacency = torch.randn(nb_samples, 5)
laplacian = torch.randn(nb_samples, 7)

dataset = TensorDataset(features, labels, adjacency, laplacian)
loader = DataLoader(
    dataset,
    batch_size=2
)

for batch_idx, (x, y, a, l) in enumerate(loader):
    print(x.shape, y.shape, a.shape, l.shape)

DeepLearner17 · October 6, 2018, 10:10pm

Thank you a lot. I was confused. When l looked at TesorDataset function l found that it takes only data_tensor and target_tensor

rchavezj · October 16, 2018, 11:04pm

What if we wish to pass in a 3d input feature into tensordata? I was told 1d or 2d were only allowed

ptrblck · October 16, 2018, 11:27pm

I’m not sure what the limit is.
If you change features in my example code to a 10dimensional tensor (features = torch.randn(nb_samples, 2, 2, 2, 2, 2, 2, 2, 2, 10)), it still works on the CPU and GPU.
I think I’ve heard once some limitation regarding the dimensionality of GPU tensors, but I can’t produce an example which is not working currently.
Who told you 3dimensional tensors are now allowed?

rchavezj · October 16, 2018, 11:56pm

Sorry I wasn’t told anything but referenced a post on stack overflow (https://stackoverflow.com/questions/41924453/pytorch-how-to-use-dataloaders-for-custom-datasets)

ptrblck · October 17, 2018, 12:03am

Thanks for the link. I had a look through the history of TensorDataset as the post came from Feb. 2017, and can’t find any constraints regarding 2dimensional tensors.
This would also mean that no image tensors could be used in TensorDataset which would be a strange design.
Another person commented a week later that tensors with an arbitrary number of dimensions can be used, so I guess it’s just a mistake by the user.

rchavezj · October 17, 2018, 1:14am

Continue being awesome!

Mohamed_Ragab · October 19, 2019, 4:01am

Thanks for your clear explanation. But how what if in the given example I have different number of samples (nb_samples) for features and adjacency? How can I deal with this?

ptrblck · October 19, 2019, 6:03am

I assume features and adjacency correspond to your input and target data?
If so, how are the data samples corresponding to the target, if you have different number of samples?

Mohamed_Ragab · October 19, 2019, 6:45am

Thanks for your reply, in my case, I have input_tensor, target_tensor and label. the target tensors are somehow reconstructed versions of the input tensors and its on;y available for input tensors without labels. For example, I have 500 input samples, 200 of them have labels, and the other 300 hundred doesn’t have labels but have their corresponding target tensors.

Mohamed_Ragab · October 19, 2019, 6:53am

I have workaround this problem by splitting the data into two parts. But now I have another Issue:
train_dataset= TensorDataset(input_tensor,target_tensor, label)
train_dl = DataLoader(train_dataset,batch_size=batch_size, shuffle=True,drop_last=drop_last)
My issue is that I need to have a pair of input and target tensor. But when I activate shuffling the input and target are somehow shuffled in a different manner. Is there a method to keep the input tensor and target tensor in pairs even with shuffling activated.

ptrblck · October 19, 2019, 10:50am

If I understand the code correctly, you are passing now tensors with a different shape to TensorDataset.
I would rather create a single target tensor and make sure the target and labels correspond to the input samples.

Shuffling will shuffle each tensor in the same way, so I assume you might see some issues if the shapes do not match. I’m not sure, if the smallest size will be used in this approach.

Mohamed_Ragab · October 20, 2019, 6:49am

This what I also assumed, but when I get a single batch without shuffle the pairs works fine. Only when shuffle is true the input and target tensors are no longer pairs. Note: to check the the issue of shuffling I have made the input and target with same shape and same values, and I found that they are shuffled independently. So, please, can you help with this

ptrblck · October 20, 2019, 9:14am

Could you post a code snippet to reproduce this issue, since the data and target samples should be shuffled in the same way. Otherwise their correspondence would be broken.

This dummy example works fine:

x = torch.arange(100)
y = torch.arange(100)

dataset = torch.utils.data.TensorDataset(x, y)
loader = torch.utils.data.DataLoader(
    dataset,
    shuffle=True,
    batch_size=2
)

for data, target in loader:
    print((data==target).all())

Mohamed_Ragab · October 20, 2019, 9:48am

I am really thankful for your response, Let me explain further what I am doing so that you have the full picture, I am using a sequence to sequence model on time series data to achieve two tasks concurrently: forecast the next time step and predict the corresponding label of the input sequence. So, for my model for each input sequence I have two targets:(shifted version of the input and the truth label of the input). at the end of the day, I need the input tensor(n) to be similar target_tensor(n-1) as I am shifting by step 1.
Here is the code:

Class Cmapps_train_Dataset(Dataset):

def __init__(self, data_identifier,data_dir,win_size):
    """Reads source and target sequences from processing file ."""
    x_train, x_test,y_train,y_test = process_data(data_dir,data_identifier,win_size)
    self.input_tensor=(torch.from_numpy(x_train)).float()
    self.target_tensor=roll(self.input_tensor,-1,0,0)
    self.label=(torch.from_numpy(y_train)).float()
    self.num_total_seqs = len(self.input_tensor)

def __getitem__(self, index):
    """Returns one data pair (source and target)."""
    src_seq = self.input_tensor[index]
    trg_seq = self.target_tensor[index]
    labels = self.label[index]
    return src_seq, trg_seq,labels

def __len__(self):
    return self.num_total_seqs

data_dir= r"…/…/Deep Learning for RUL/data/"
data_identifier=‘FD001’
win_size=30;batch_size=10;
shuffle_stats;drop_last=True

train_data=Cmapps_train_Dataset(data_identifier,data_dir,win_size)
train_dl = DataLoader(train_data,batch_size=batch_size, shuffle=shuffle_stats,drop_last=drop_last)

=====

sample=next(iter(train_dl))
input_tesnor=sample[0]
target_tensor=sample[1]
for n in range(1,10):
print((input_tesnor[n]==target_tensor[n-1]).all())

===========

Results:
with Shuffle=False:
tensor(True)
tensor(True)
tensor(True)
tensor(True)
tensor(True)
tensor(True)
tensor(True)
tensor(True)
tensor(True)

=====

Results with shuffle=True :

tensor(False)
tensor(False)
tensor(False)
tensor(False)
tensor(False)
tensor(False)
tensor(False)
tensor(False)
tensor(False)

Mohamed_Ragab · October 20, 2019, 11:02am

I am completely embarrassed, but I found that I am trying to match input and target with different indices. When I compared input and target with the same index I found that their correspondence is kept. I apologize again for wasting your valuable time.

ptrblck · October 20, 2019, 12:00pm

Haha, good to hear it’s working now!
Don’t feel embarrassed. Yesterday I forgot to zero out the gradients and was debugging my model for 15 minutes

Mohamed_Ragab · October 20, 2019, 1:47pm

Haha, so even experts can go wrong at some times

hamid_20 · August 21, 2021, 2:21pm

hello
What is a labels ?