How can i create a single dataloader for my two csv files . I have tried this

sam101 · October 24, 2019, 8:06am

data_file1 = “/Data_train_A.csv”
rnaseq = pd.read_csv(data_file1, index_col=0, header=0)
rnaseq_tensor1 = torch.FloatTensor(rnaseq.values)
#print(rnaseq.shape)

data_file2 = “/Data_train_B.csv”
rnaseq = pd.read_csv(data_file2, index_col=0, header=0)
rnaseq_tensor2 = torch.FloatTensor(rnaseq.values)
#print(rnaseq.shape)

dataset = TensorDataset(rnaseq_tensor1,rnaseq_tensor2)
dataloader = DataLoader(dataset,batchsize=2)

for batch_idx, (a,b) in enumerate(dataloader):
print(a.shape, b.shape)

ptrblck · October 24, 2019, 10:17am

What does the print statement output and what is your expectation?

dhpollack · October 24, 2019, 11:31am

Do you want this as one big list or do you want them parallel?

sam101 · October 24, 2019, 11:55am

i want them as an input to my cycleGan, where we have data from domain A and domain B

sam101 · October 24, 2019, 11:56am

i want them as an input to my cycleGan, where we have data from domain A and domain B. It is used for mapping from domain A to B . So my inputs are both ata time to cycleGan

ptrblck · October 24, 2019, 12:01pm

And do you get any error using this approach?
We would need the error message or some more information on what’s not working to help debugging.

sam101 · October 24, 2019, 12:17pm

yeah it shows me this

AssertionError Traceback (most recent call last)

in ()
10 #print(rnaseq.shape)
11
—> 12 dataset = TensorDataset(rnaseq_tensor1,rnaseq_tensor2)
13 dataloader = DataLoader(dataset,batchsize=2)
14

/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataset.py in init(self, *tensors)
156
157 def init(self, *tensors):
–> 158 assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors)
159 self.tensors = tensors
160

AssertionError:

ptrblck · October 24, 2019, 12:21pm

This error points to different lengths of both input tensors.
TensorDataset would return samples from both inputs using the same index internally.
If one tensor contains more samples (dim0 is used for indexing), this error is thrown.

You could e.g. crop the larger tensor to the same length as the smaller one or duplicate the smaller one.
If you want to apply some more complicated sampling strategy, I would recommend to write a custom Dataset and create the pairs in __getitem__.

sam101 · October 24, 2019, 12:25pm

means i have to create separate dataloaders then

ptrblck · October 24, 2019, 12:27pm

No. You should pass the same amount of samples or create some correspondence between both tensors in your custom Dataset (or collate_fn etc.).

sam101 · October 24, 2019, 2:00pm

could you please provide me an example

ptrblck · October 24, 2019, 2:18pm

Here is a small example to reproduce this error and how to slice the larger tensor:

# Works, since a and b have the same length
a = torch.arange(10).view(-1, 1)
b = torch.arange(10).view(-1, 1)

dataset = TensorDataset(a, b)
loader = DataLoader(
    dataset,
    batch_size=5
)

for idx, (data1, data2) in enumerate(loader):
    print('Idx ', idx)
    print(data1)
    print(data2)

# Use different lengths
a = torch.arange(10).view(-1, 1)
b = torch.arange(20).view(-1, 1)

dataset = TensorDataset(a, b) # fails
# Slice b to have the same length
dataset = TensorDataset(a, b[:a.size(0)])
loader = DataLoader(
    dataset,
    batch_size=5
)

for idx, (data1, data2) in enumerate(loader):
    print('Idx ', idx)
    print(data1)
    print(data2)

Note that this approach might not be the best for your use case, so you should apply your method to yield corresponding pairs of both input tensors.

sam101 · November 6, 2019, 8:14am

Thanks it really helped. Thanks again