DataLoader - How to load the data in sequence

Ladies and gentlemen, new to the world of ML and its great fun, however I’m slowly but surely going crazy trying to solve this.

I’ll do my best to explain my problem, but I think it’s the DataLoader that’s the issue or its a very easy fix but I cant see the forest because of all the threes.

Any kind soul(s) out there that can help, many thanks in advance!

Have used this as reference for my model:

Everything is working well but when I’m trying to reuse the DataLoader & inference for my timeseries data for predictions I´m just not getting the results I need. The DataSet is a timeseries where batches are linked to a Re-index. Let’s say that I have 5 batches / indexes in my data and I want a prediction made for each one in sequence. Since the DataLoader is pulling the index from getitem and that in turn pulls an index between 1 and len from the data, DataLoader set to shuffle can give me predictions of index 2,2,3,2,5 = 3 predictions when I want it to do 1,2,3,4,5 = 5 predictions.

Giving getitem a fixed index and turning of shuffle I get (1)5 predictions of the same index.

Surely not the right way to do it but I have tried to iterate in the getitem to achieve above but nothing is working. I saw some documentation on itterableDataset with start / end, but couldn’t get that to work either (I’m just to inexperienced at the moment to get my head around it).

Please help so I can continue my project.

I’m not sure I understand the use case completely, but

that’s not the case. By default (unless you are creating your own DataLoader) the sampler will be used to create the batch indices and the DataLoader will grab these indices and pass it to Dataset.__getitem__. The __getitem__ method is not creating the indices, but consuming them.
Could you describe the issue a bit more, in particular where the indices are coming from?

Im probably not explaining very well, sorry about that.

Im using a CSV file with data that looks something like this:

Date Name Re-index Data1 Data2 Data3 Data4
2021-10-13 A 1 x x x x
2021-10-14 A 1 x x x x
2021-10-15 A 1 x x x x
2021-10-13 B 2 x x x x
2021-10-14 B 2 x x x x
2021-10-15 B 2 x x x x
2021-10-13 C 3 x x x x
2021-10-14 C 3 x x x x
2021-10-15 C 3 x x x x

Every re-index number I want a prediction on in sequence. But I can only get the model to give me either random predictions x times or x times predicitons of a specific re-index.

The CSV Data goes into the DataLoader and as I understand the getitem will take one of the Re-Index batches and send to the model. For predictions I use the inference tab where I can specify how many times it should iterate a prediction. When I run DataLoader Shuffle True it gives random Re-index predictions x times, when shuffle = False it gives a prediction on max re-reindex x times and when I specify a Re-index at getitem idx=2 I will get a prediction on that Index x times.

In short it gives me predictions of Re-index at range(3) of either:
DataLoader Shuffle = True - 2,1,2
DataLoader Shuffle = False = Len / max - 3,3,3
DataLoader Shuffle = Flase, idx Specific - 2,2,2
I want - 1,2,3

Thanks for helping out, much appreciated!

Thanks for the update. If I understand your use case correctly you want to use a specific sampling strategy where each batch should contain a sample of each Re-index or at least they should not be repeated in a single batch.
In that case you could create a custom sampler and implement the sampling logic there to make sure each batch contains different Re-index values.

No thank you for helping out!
Correct, and the model is actually doing this at the moment, but its just not doing it in sequence but randomly or static. And I have tried to look at how to change the DataLoader to do this. But Im too green to make it happen, dont know how to make the getitem idx to go in sequence. If you can point me in the right direction with a bit of guidance that would be awesome! Below is the full code for the DL.

import pandas as pd
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import MinMaxScaler
import os
import torch
import numpy as np
import random
import matplotlib.pyplot as plt
from joblib import dump
from icecream import ic

class SensorDataset(Dataset):
“”“Face Landmarks dataset.”""

def __init__(self, csv_name, root_dir, training_length, forecast_window):
    """
    Args:
        csv_file (string): Path to the csv file.
        root_dir (string): Directory
    """
    
    # load raw data file
    csv_file = os.path.join(root_dir, csv_name)
    self.df = pd.read_csv(csv_file)
    self.root_dir = root_dir
    self.transform = MinMaxScaler()
    self.T = training_length
    self.S = forecast_window

def __len__(self):
    # return number of sensors
    return len(self.df.groupby(by=["reindexed_id"]))

# Will pull an index between 0 and __len__. 
def __getitem__(self, idx):
    
    # Sensors are indexed from 1
    idx = idx+1

    # np.random.seed(0)

    start = np.random.randint(0, len(self.df[self.df["reindexed_id"]==idx]) - self.T - self.S) 
    sensor_number = str(self.df[self.df["reindexed_id"]==idx][["sensor_id"]][start:start+1].values.item())
    index_in = torch.tensor([i for i in range(start, start+self.T)])
    index_tar = torch.tensor([i for i in range(start + self.T, start + self.T + self.S)])
    _input = torch.tensor(self.df[self.df["reindexed_id"]==idx][["humidity", "sin_hour", "cos_hour", "sin_day", "cos_day", "sin_month", "cos_month"]][start : start + self.T].values)
    target = torch.tensor(self.df[self.df["reindexed_id"]==idx][["humidity", "sin_hour", "cos_hour", "sin_day", "cos_day", "sin_month", "cos_month"]][start + self.T : start + self.T + self.S].values)

    # scalar is fit only to the input, to avoid the scaled values "leaking" information about the target range.
    # scalar is fit only for humidity, as the timestamps are already scaled
    # scalar input/output of shape: [n_samples, n_features].
    scaler = self.transform

    scaler.fit(_input[:,0].unsqueeze(-1))
    _input[:,0] = torch.tensor(scaler.transform(_input[:,0].unsqueeze(-1)).squeeze(-1))
    target[:,0] = torch.tensor(scaler.transform(target[:,0].unsqueeze(-1)).squeeze(-1))

    # save the scalar to be used later when inverse translating the data for plotting.
    dump(scaler, 'scalar_item.joblib')

    return index_in, index_tar, _input, target, sensor_number

Take a look at this post to check how the BatchSampler could work (it’ll pass a batch of indices to the __getitem__ method). To do so, you could create a custom sampling strategy in a derived sampler class and create the batch of indices here.
Let me know, if you get stuck and I can try to follow up with an example code snippet.

Thank you very much for you help. Ill check it out!