DataLoader - Problems

chaslie · July 12, 2019, 3:50pm

Afternoon,

I’m hoping someone can help me.

I am using Dataloader to load 16 1x2048 vectors into pytorch. The problem is that when i view what DataLoader is loading I find that the values have all been shuffled up. I have a tensor which is 16x1x2048 and all the values are present, just not in the correct order (to quote the late, great Eric Morcombe :-)…)

Does anybody know how I can extract 16 1x2048 vectors which maintain in the orginal file please?

Thanks for your help

Chaslie

Prerna_Dhareshwar · July 12, 2019, 4:54pm

Theres an argument to Dataloader() called shuffle, just set shuffle = False. Hope this helps.

chaslie · July 12, 2019, 6:24pm

Hi Prerna,

I Tried this and this doesn’t affect the make up of the vector. I would the order of the vector to be maintained ie:

(1 2 3 4 5 6 7 8)

but pytorch dataloader is loading the vector as:

(1 4 2 6 7 3 8 5)

Prerna_Dhareshwar · July 12, 2019, 6:31pm

Thats strange, it shouldn’t shuffle along the feature dimension like that. Can you post some code?
Edit : Follow up question, along what dimension is the dataloader shuffling the values? Batch or feature?

chaslie · July 12, 2019, 6:49pm

Hi Prerna,

I am loading the data from a 42000x2048 array, and then creating a 16x1x2048 tensor called data, when i view this it has lost the order from the 42000x2048 array

train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=BATCHSIZE, shuffle=False)

for epoch in range(num_epochs):
    for data, target in train_loader:
#        print("data=",data.shape)
        np.savetxt(f,data.numpy())
        np.savetxt(f2,target.numpy())
        data=data.unsqueeze(1)
        print("data_sq=",data.shape)
        data = data.cuda()

I think its shuffling along the feature, though looking at the data i would guess that its generating a 16x1x2048 tensor of numbers randomly taken from x_train array

Prerna_Dhareshwar · July 12, 2019, 7:12pm

I think the issue might be when you are reshaping it that it is shuffling, because the data loader itself never shuffles along the feature dimension.

chaslie · July 12, 2019, 7:13pm

is there a way I can view the contents of train_loader and train_dataset?

Prerna_Dhareshwar · July 12, 2019, 7:28pm

So was TensorDataset() a custom dataset that you defined?

Edit: Looks like its a predefined instance of the Dataset() class, I think you can find the source code online for it.

chaslie · July 12, 2019, 7:29pm

yes, X_Train is the 42000x2048 input data and y_train is the labels 1x42000 array numbering from 1 t0 42000

Prerna_Dhareshwar · July 12, 2019, 7:30pm

Here - https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#TensorDataset

chaslie · July 12, 2019, 7:37pm

ok so being a bit slow (Friday night over here!) is the problem with the y_train array calling numbers 1 to 42000, and if so whats the best way to solve it? should I leave y_train empty?

Prerna_Dhareshwar · July 12, 2019, 7:45pm

Okay can you try this in the for loop-

ctr = 0
for idx, (batch, target) in enumerate(train_loader):
    if ctr ==0:
        print(batch)
    batch += 1

Now check if it shuffles it?
I did this and it didn’t shuffle along feature dimension for me

Prerna_Dhareshwar · July 12, 2019, 7:47pm

Actually even when I do

for (batch, target) in train_loader:

It doesn’t shuffle it.

I’m not sure what the issue might be in your case.

chaslie · July 12, 2019, 7:56pm

its still shuffling, i wish i caould view the contents of:

train_dataset = TensorDataset(X_train, y_train)

and

train_loader = DataLoader(train_dataset, batch_size=BATCHSIZE, shuffle=True)

ptrblck · July 12, 2019, 8:30pm

You can index the Dataset directly and check for equal values:

x = torch.randn(42, 2048)
y = torch.randn(42, 1)

dataset = TensorDataset(x, y)

for idx in range(x.size(0)):
    data, target = dataset[idx]
    assert (data==x[idx]).all(), "data shuffled"
    assert (target==y[idx]).all(), "target shuffled"

That would raise an exception, since the number of samples is different for X_Train and y_train (42000 vs. 1). Could you check it again and post the code how you are creating the [16, 1, 2048]-shaped tensor?

chaslie · July 14, 2019, 10:59am

Hi Ptrblck,

apologies for the delay in getting back to you. I ran your code on a sample deck and compared the values in data to the input array X and i can see that the tensor data doesn’t get shuffled by view the output in the print data statement to the print X statement…

from __future__ import print_function
import torch
import torch.nn.parallel
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset

x = torch.randn(42, 2048)
y = torch.randn(42, 1)
print("x=",x)
#print("y=",y)
dataset = TensorDataset(x, y)
train_loader = DataLoader(dataset, batch_size=1, shuffle=True)
for idx in range(x.size(0)):
    data, target = dataset[idx]
    assert (data==x[idx]).all(), "data shuffled"
    assert (target==y[idx]).all(), "target shuffled"   
for epoch in range(1):
    for (data, target) in train_loader:
        #print("data=",data.shape)
        print("data",data)
        #np.savetxt(f2,target.numpy())

however when I repeat the same exercise with my dataset the contents of data are not in the same order as the input. The code is:

X_train=train_array.transpose([1,0,2]).reshape(42000,2048)  #the input array size is [42000,2048,1]
X_train=torch.Tensor(X_train)
y_train=np.arange(1,42001,1)
y_train=y_train.reshape(42000,1)
y_train=torch.from_numpy(y_train)

train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=BATCHSIZE, shuffle=True)

for idx in range(X_train.size(0)):
    data, target = train_dataset[idx]
    assert (data==X_train[idx]).all(), "data shuffled"
    assert (target==y_train[idx]).all(), "target shuffled"

for epoch in range(num_epochs):
    for (data, target) in train_loader:
        #print("data=",data.shape)
        np.savetxt(f,data.numpy())
        np.savetxt(f2,target.numpy())
        data=data.unsqueeze(1)
        print("data_sq=",data.shape)

I’m begining to think that i have done something silly but i really can’t see what I have done

Chaslie

chaslie · July 14, 2019, 12:00pm

All Fixed, thanks for all your help

ptrblck · July 14, 2019, 12:25pm

What was the reason for the shuffled data?

chaslie · July 14, 2019, 2:30pm

to be honest, I do not know. I just kept changing things until it work