What is the most standard way to put data in batch?

tcsn_wty · June 9, 2020, 1:29pm

I’m doing NLP projects, mostly using RNN, LSTM and BERT. I’ve never systematically learned PyTorch, and have seen many ways of putting data into torch tensors before passing to neural network. However, it seems that different ways sometimes can also influence the training process. I would like to know if anyone happen to know a most standard way to do so.

Say I’m doing text classification. Here’s the two ways I’ve tried that both worked for some projects.

First way:

from sklearn.model_selection import train_test_split

# Use 90% for training and 10% for validation.
# x is my input (numpy.ndarray), y is my label (numpy.ndarray).
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42, test_size=0.9)

# Convert all inputs and labels into torch tensors, the required datatype for our model.
x_train = torch.tensor(x_train)
x_test = torch.tensor(x_test)
y_train = torch.tensor(y_train)
y_test = torch.tensor(y_test)

from torch.utils.data import DataLoader, TensorDataset

batch_size = 32

# Create the DataLoader for our training set.
train_data = TensorDataset(train_inputs, train_labels)
train_dataloader = DataLoader(train_data, batch_size=batch_size)

# Create the DataLoader for our validation set.
validation_data = TensorDataset(validation_inputs, validation_labels)
validation_dataloader = DataLoader(validation_data, batch_size=batch_size)

Then during training we’ll do

for step, batch in enumerate(train_dataloader):
        inputs = batch[0].cuda()
        labels = batch[1].cuda()

The other way I’ve used is:

from torch.utils.data import DataLoader, TensorDataset

# create tensor dataset
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

batch_size = 32

# shuffle data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size, drop_last=True)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size, drop_last=True)

for inputs, labels in train_loader:

I wonder if there is an document claiming a standard way to do such thing in PyTorch. More specifically, is there a standard way to make my input (say, list of embedded sentences) into batches and also acceptable by PyTorch neural network?

ptrblck · June 10, 2020, 7:38am

I don’t know exactly what difference you would like to highlight.
In the first approach train_inputs and train_labels seem to be undefined (as well as validation_x), so I assume you would like to use x_train etc.?
Also, the DataLoader loop is different, since you are unpacking the values in the first approach inside the loop (but that doesn’t matter and is just different coding style).

Are these the differences or what would you like to discuss?