I’m doing NLP projects, mostly using RNN, LSTM and BERT. I’ve never systematically learned PyTorch, and have seen many ways of putting data into torch tensors before passing to neural network. However, it seems that different ways sometimes can also influence the training process. I would like to know if anyone happen to know a most standard way to do so.
Say I’m doing text classification. Here’s the two ways I’ve tried that both worked for some projects.
First way:
from sklearn.model_selection import train_test_split
# Use 90% for training and 10% for validation.
# x is my input (numpy.ndarray), y is my label (numpy.ndarray).
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42, test_size=0.9)
# Convert all inputs and labels into torch tensors, the required datatype for our model.
x_train = torch.tensor(x_train)
x_test = torch.tensor(x_test)
y_train = torch.tensor(y_train)
y_test = torch.tensor(y_test)
from torch.utils.data import DataLoader, TensorDataset
batch_size = 32
# Create the DataLoader for our training set.
train_data = TensorDataset(train_inputs, train_labels)
train_dataloader = DataLoader(train_data, batch_size=batch_size)
# Create the DataLoader for our validation set.
validation_data = TensorDataset(validation_inputs, validation_labels)
validation_dataloader = DataLoader(validation_data, batch_size=batch_size)
Then during training we’ll do
for step, batch in enumerate(train_dataloader):
inputs = batch[0].cuda()
labels = batch[1].cuda()
The other way I’ve used is:
from torch.utils.data import DataLoader, TensorDataset
# create tensor dataset
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))
batch_size = 32
# shuffle data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size, drop_last=True)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size, drop_last=True)
for inputs, labels in train_loader:
I wonder if there is an document claiming a standard way to do such thing in PyTorch. More specifically, is there a standard way to make my input (say, list of embedded sentences) into batches and also acceptable by PyTorch neural network?