Batch sample from the dataset

DrunkAlex · April 13, 2019, 8:32am

Hi all!!
I am new in torch. My task is to train a model by using batch samples from the dataset. I can not use loops for collecting samples into the batch and torch.utils.data.DataLoader is also prohibited. I can only iterate over the batches in the dataset. So my question is, how to create these batches in the dataset with the restrictions that I mentioned above.

This is how I did before where get_batches is a special function for collecting samples into the batch and I want to replace it.

Oli · April 13, 2019, 11:37am

I wrote this piece of code for another topic. It does use a loop to collect a batch from a dataset so I’m not sure if it could be useful to you. If it’s not, have you taken a look at pytorch’s batch sampler?

Edit: What datastructures are X_train & y_train? Numpy arrays? Tensors?

import numpy as np
import torch
from torch.utils.data import TensorDataset
import random
import more_itertools

def load_data():
  # Fake data. You can also load your images and convert them into tensors.
  number_images = 100
  images = torch.randn(number_images, 3, 2, 2)
  labels = torch.ones(number_images, 1)
  return TensorDataset(images, labels)

def get_batch(dataset, batch_idx):
  ''' Returns the data items given batch indexes '''

  # Set up the datastructures
  im_size = dataset[0][0].size()
  batch_size = len(batch_idx)
  batch_data = torch.empty((batch_size, *im_size))
  batch_labels = torch.empty((batch_size, 1))
  
  # Add data to datastructures
  for i, data_idx in enumerate(batch_idx):
    data, label = dataset[data_idx]
    batch_data[i] = data
    batch_labels[i] = label

  return batch_data, batch_labels

dataset = load_data()
data_length = len(dataset)

batch_size = 10
n_epochs = 10
for epoch in range(n_epochs):
  # Create indexes, shuffles them and split them into batches
  indexes = list(range(data_length))
  random.shuffle(indexes)
  indexes = more_itertools.chunked(indexes, batch_size)

  for batch_idx in indexes:
    images, labels = get_batch(dataset, batch_idx)
    # You can now work with your data