Dataloader: how to use it with user's dataset?

yanxon · November 20, 2018, 2:05am

According to my understanding, DataLoader can be use to define batch_size+shuffle. I would like to use these DataLoader feature for my own dataset.

X = numpy[8000, 432] (8000 data, 432 features)
Y = numpy[8000, ]

I write this code to train my own dataset:

dataset = np.column_stack((self.X_train,self.Y_train))
train_loader = DataLoader(dataset=dataset,
                          batch_size=32,
                          shuffle=True,
                          num_workers=2)

self.model = self.Net(self.feature_size, self.n_layers, self.n_neurons)
        
self.model.train()
optimizer = optim.Adam(self.model.parameters(), lr = self.learning_rate)
loss_func = nn.MSELoss()

# Training step
for (X,Y) in train_loader:
    x, y = Variable(X), Variable(Y)
    optimizer.zero_grad()
            
    self.y_train = self.model(x)
    loss = loss_func(self.y_train, y)
            
    loss.backward()  #retain_graph=True
    optimizer.step()

Am I on the right track?

ptrblck · November 20, 2018, 2:21pm

Using this approach, your DataLoader will yield numpy array, which you would have to transform into torch.tensors first.
This might be a better approach:

X_train = torch.from_numpy(np.random.randn(64, 2)).float()
y_train = torch.from_numpy(np.random.randint(0, 10, (64,)))
dataset = torch.utils.data.TensorDataset(X_train, y_train)

Also, Variables are deprecated since PyTorch 0.4.0.
If you are using a newer version, you can just remove Variables. The volatile flag was replaced with a with statement:

with torch.no_grad():
    # your validation procedure

yanxon · November 20, 2018, 5:08pm

Hi ptrblck,

Thank you for the response.

Does TensorDataset give me the shuffle and batch_size options?

beneyal · November 20, 2018, 5:52pm

What gives you the shuffle and batch_size options is the torch.utils.data.DataLoader. Feed it the TensorDataset and you’re golden