Dataloader / how to devide dataset to training and test

pedram1 · November 27, 2019, 6:33pm

Hi everyone

I am working with signal (1D) data that has the shape of (65536,94) , here 65536 means number of my samples and 94 is length of each sample. (each signal has the length of 94). I would like to divide my dataset to training and testing sets.
I actually have no idea how I can do this by using dataloader class of pytorch.
Dose anyone can help me about this?

Thank you

beaupreda · November 27, 2019, 7:15pm

Hello,

Usually, the splitting of training and testing data is done before using the DataLoader class of PyTorch, as the classe takes a dataset as a parameter. What you could do is separate your 65536 x 94 tensor into two tensors, one for training and the other one for testing (my rule of thumb is keep around 20% for testing). Then, you could use the PyTorch Dataset class to create you own custom dataset that would then be used in the DataLoader. Here is a small example:

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return len(self.data)

x = torch.rand(size=(65536, 94), dtype=torch.float32)
separation = int(x.shape[0] * 0.8)
train = x[:separation]
test = x[separation:]

train_dataset = CustomDataset(train)
test_dataset = CustomDataset(test)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32)

for train_sample in train_loader:
    print(train_sample.shape)
    # torch.Size([64, 94])

for test_sample in test_loader:
    print(test_sample.shape)
    # torch.Size([32, 94])

Hope this helps!

pedram1 · November 28, 2019, 10:12pm

Hi Beaupreda

Thank you for your reply. Actually it worked I only couldn’t understand that how below codes are working:
separation = int(x.shape[0] * 0.8)
train = x[:separation]
test = x[separation:]

x[separation :] ??? what is it doing?

and one more question, that why u did not make shuffle = True? was there any specific reason for that?

Thank you again,
Pedram

beaupreda · November 29, 2019, 2:19pm

x[separation:] is slicing an array (or tensor in this case). Basically, separation is a limit equal to 80% of the whole dataset, since we want to keep 80% of the data for training and the remaining 20% for testing.

x[:separation] takes all the elements of the tensor up to separation (so 0, 1, 2, 3, …, separation - 1)
x[separation:] takes all the elements of the tensor from separation to the length of the tensor (so separation, separation + 1, …, 65534, 65535)

Here is a small example with a 5 x 5 tensor:

    x = torch.rand(size=(5, 5), dtype=torch.float32)
    print(x)
    # tensor([[0.0345, 0.3828, 0.2489, 0.4129, 0.4522],
    #         [0.0787, 0.7049, 0.2124, 0.2115, 0.1857],
    #         [0.6836, 0.7091, 0.2063, 0.1679, 0.3338],
    #         [0.7525, 0.1769, 0.1104, 0.0380, 0.6871],
    #         [0.9377, 0.6564, 0.2296, 0.5100, 0.7274]])
    separation = 3

    # first 3 rows
    y = x[:separation]
    print(y)
    # tensor([[0.0345, 0.3828, 0.2489, 0.4129, 0.4522],
    #         [0.0787, 0.7049, 0.2124, 0.2115, 0.1857],
    #         [0.6836, 0.7091, 0.2063, 0.1679, 0.3338]])

    # last 2 rows
    z = x[separation:]
    print(z)
    # tensor([[0.7525, 0.1769, 0.1104, 0.0380, 0.6871],
    #         [0.9377, 0.6564, 0.2296, 0.5100, 0.7274]])

    # rows 1, 2, 3
    d = x[1:4]
    print(d)
    # tensor([[0.0787, 0.7049, 0.2124, 0.2115, 0.1857],
    #         [0.6836, 0.7091, 0.2063, 0.1679, 0.3338],
    #         [0.7525, 0.1769, 0.1104, 0.0380, 0.6871]])

No specific reason, except that I forgot!

pedram1 · November 29, 2019, 7:14pm

Great.
I appreciate your help…