Dataloader / how to devide dataset to training and test

Hi everyone

I am working with signal (1D) data that has the shape of (65536,94) , here 65536 means number of my samples and 94 is length of each sample. (each signal has the length of 94). I would like to divide my dataset to training and testing sets.
I actually have no idea how I can do this by using dataloader class of pytorch.
Dose anyone can help me about this?

Thank you

Hello,

Usually, the splitting of training and testing data is done before using the DataLoader class of PyTorch, as the classe takes a dataset as a parameter. What you could do is separate your 65536 x 94 tensor into two tensors, one for training and the other one for testing (my rule of thumb is keep around 20% for testing). Then, you could use the PyTorch Dataset class to create you own custom dataset that would then be used in the DataLoader. Here is a small example:

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return len(self.data)

x = torch.rand(size=(65536, 94), dtype=torch.float32)
separation = int(x.shape[0] * 0.8)
train = x[:separation]
test = x[separation:]

train_dataset = CustomDataset(train)
test_dataset = CustomDataset(test)

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32)

for train_sample in train_loader:
    print(train_sample.shape)
    # torch.Size([64, 94])

for test_sample in test_loader:
    print(test_sample.shape)
    # torch.Size([32, 94])

Hope this helps!

3 Likes

Hi Beaupreda

Thank you for your reply. Actually it worked :smile: I only couldn’t understand that how below codes are working:
separation = int(x.shape[0] * 0.8)
train = x[:separation]
test = x[separation:]

x[separation :] ??? what is it doing?

and one more question, that why u did not make shuffle = True? was there any specific reason for that?

Thank you again,
Pedram

x[separation:] is slicing an array (or tensor in this case). Basically, separation is a limit equal to 80% of the whole dataset, since we want to keep 80% of the data for training and the remaining 20% for testing.

x[:separation] takes all the elements of the tensor up to separation (so 0, 1, 2, 3, …, separation - 1)
x[separation:] takes all the elements of the tensor from separation to the length of the tensor (so separation, separation + 1, …, 65534, 65535)

Here is a small example with a 5 x 5 tensor:

    x = torch.rand(size=(5, 5), dtype=torch.float32)
    print(x)
    # tensor([[0.0345, 0.3828, 0.2489, 0.4129, 0.4522],
    #         [0.0787, 0.7049, 0.2124, 0.2115, 0.1857],
    #         [0.6836, 0.7091, 0.2063, 0.1679, 0.3338],
    #         [0.7525, 0.1769, 0.1104, 0.0380, 0.6871],
    #         [0.9377, 0.6564, 0.2296, 0.5100, 0.7274]])
    separation = 3

    # first 3 rows
    y = x[:separation]
    print(y)
    # tensor([[0.0345, 0.3828, 0.2489, 0.4129, 0.4522],
    #         [0.0787, 0.7049, 0.2124, 0.2115, 0.1857],
    #         [0.6836, 0.7091, 0.2063, 0.1679, 0.3338]])

    # last 2 rows
    z = x[separation:]
    print(z)
    # tensor([[0.7525, 0.1769, 0.1104, 0.0380, 0.6871],
    #         [0.9377, 0.6564, 0.2296, 0.5100, 0.7274]])

    # rows 1, 2, 3
    d = x[1:4]
    print(d)
    # tensor([[0.0787, 0.7049, 0.2124, 0.2115, 0.1857],
    #         [0.6836, 0.7091, 0.2063, 0.1679, 0.3338],
    #         [0.7525, 0.1769, 0.1104, 0.0380, 0.6871]])

No specific reason, except that I forgot!

1 Like

Great.
I appreciate your help…