Correctly feeding LSTM with minibatch time sequence data

goffta · July 30, 2019, 9:00pm

Hi,

I’m having trouble with setting the correct tensor sizes for my research. I have about 400000 data points in the form: time, value. They are in a csv file. I would like to feed my LSTM in mini batches of 20 sequences of length 100 for each batch. I’m not sure how to that properly. Any advise appreciated.

ptrblck · August 2, 2019, 6:40pm

From the docs:

input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence. …

How many features does each sample contain?
Assuming it’s just a single feature, each batch should have the shape [100, 20, 1] using the default setup.
However, if you specify batch_first=True, you would need to pass your data as [20, 100, 1].

goffta · August 2, 2019, 6:57pm

Each sample contains only 1 feature, paired with 1 label. The data is a time series. I’ve been reading the data using Pandas. I’ve attempted to use the DataSet and Dataloader. When using dataloader, to obtain the correct tensor, would it be as simple as specifying batch_size=20? It seems as this would only load a [20,1] tensor.

ptrblck · August 2, 2019, 7:05pm

If you load a single sample in your Dataset's __getitem__ method in the shape [seq_len, features], your DataLoader should return [batch_size, seq_len, features] using the default collate_fn.

In the training loop you could permute the dimensions to match [seq_len, batch_size, features] or just use batch_size=First in your LSTM.

goffta · August 2, 2019, 7:13pm

First of all, thanks for the replies and advice! if I’m understanding correctly, my Dataset class should look like this:
class OUDataset(Dataset): def init(self, csv_file): self.oudataframe = pd.read_csv(csv_file) self.seq_len = 100 def len(self): return len(self.oudataframe) def getitem(self, idx): dy = self.oudataframe.iloc[idx, 0:] return self.seq_len, dy

However, I get an error when calling dataloader:
TypeError: batch must contain tensors, numbers, dicts or lists; found <class ‘pandas.core.series.Series’>

My call to Dataloader looks like:
dataSet = OUDataset(csv_file=filelocationhere) dataloader = Dataloader(dataSet, batch_size=20, shuffle=True)

Any idea what might be going on?

Edit:
I changed the getitem method to:
dy = torch.tensor(self.oudataframe.iloc[idx,0:])
and no longer get the same error. How can I verify that my tensors are the correct shape?

ptrblck · August 2, 2019, 9:02pm

You could call check it via:

data, target = next(iter(dataloader))
print(data.shape)
print(target.shape)

goffta · August 3, 2019, 5:04pm

So, in the training loop I can do something l can loop for each epoch, and instead of enumerating on the dataloader, just use data, target = next(iter(dataloader)) and then do my calls to model, optimizer, and loss?

ptrblck · August 3, 2019, 11:31pm

You could create an iterator = iter(dataloader) and then call data, target = next(iterator) inside your training loop, but you would meed to catch the StopIteration manually, so I would just stick to:

for data, target in dataloder:
    ...

I’ve just used this line of code to get a single sample to check the shape.

goffta · August 4, 2019, 9:20pm

Thanks. I must be doing something incorrect. With code that looks like this:
dataloader = DataLoader(dataSet, batch_size=20, shuffle=True)

for t, dy in enumerate(dataloader):
print(t)
print(dy)
print(dy[1])
break

I get:
Testing dataset class … Moving to dataloader section … 0 [tensor([100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100]), tensor([[ 1.7786e+04, 7.9729e-01], [ 1.2340e+04, -1.6030e-01], [ 9.1269e+03, -7.1534e-02], [ 2.9582e+04, -7.2178e-01], [ 1.2473e+04, -3.4016e-01], [ 1.1071e+04, 2.3349e-01], [ 3.2279e+04, -1.3710e+00], [ 1.4720e+03, 2.4033e-01], [ 2.0018e+04, -2.2206e-01], [ 1.4384e+04, -7.4043e-01], [ 3.5270e+02, -3.3346e-01], [ 1.1394e+04, 3.3650e-01], [ 2.2026e+04, -4.4804e-02], [ 2.3803e+04, 2.2828e-01], [ 1.6722e+04, 1.6180e-01], [ 2.7046e+04, -6.7901e-01], [ 2.9860e+04, 1.5923e-01], [ 2.4130e+04, -7.0021e-01], [ 1.1229e+04, 7.4529e-01], [ 3.6677e+04, -5.1128e-01]])] tensor([[ 1.7786e+04, 7.9729e-01], [ 1.2340e+04, -1.6030e-01], [ 9.1269e+03, -7.1534e-02], [ 2.9582e+04, -7.2178e-01], [ 1.2473e+04, -3.4016e-01], [ 1.1071e+04, 2.3349e-01], [ 3.2279e+04, -1.3710e+00], [ 1.4720e+03, 2.4033e-01], [ 2.0018e+04, -2.2206e-01], [ 1.4384e+04, -7.4043e-01], [ 3.5270e+02, -3.3346e-01], [ 1.1394e+04, 3.3650e-01], [ 2.2026e+04, -4.4804e-02], [ 2.3803e+04, 2.2828e-01], [ 1.6722e+04, 1.6180e-01], [ 2.7046e+04, -6.7901e-01], [ 2.9860e+04, 1.5923e-01], [ 2.4130e+04, -7.0021e-01], [ 1.1229e+04, 7.4529e-01], [ 3.6677e+04, -5.1128e-01]])
This seems incorrect - while I seem to have 20 batches of 100, when printing dy (features, labels), I only seem to have 20 different points in the sequence, where I was expecting to have 100. What am I doing incorrectly? I don’t mind sharing the complete source if necessary.

Thanks in advance!

ptrblck · August 4, 2019, 9:29pm

There might be a misunderstanding I’ve missed in your earlier post.
When I said your DataLoader would return a batch of shape [batch_size, seq_len, features], I was referring to the dimensions, e.g. [20, 100, 2].

In your custom Dataset you are directly returning the value of self.seq_len.
Instead you should create a data sample of the shape [seq_len, features].
I was assuming this format was already stored in your oudataframe.

goffta · August 4, 2019, 9:50pm

So,

my getitem method in the custom Dataset should be:
def getitem(self, idx): data = torch.zeros(100, 2) for i in range(0, 100): data[i] = torch.tensor(self.oudataframe.iloc[idx,0:]) return data

ptrblck · August 4, 2019, 9:52pm

This would create a tensor with 100 identical values, as you are indexing the dataframe with [idx, 0:], which aren’t changed in the loop.

goffta · August 4, 2019, 9:56pm

Ah, thanks for catching that silly error. It should be idx+i, then. Doing so should hopefully give me the correct shape. I must have a fundamental misunderstanding of what’s actually happening here. After making the change in the getitem method of the custom dataset to:
def getitem(self, idx): data = torch.zeros(100, 2) for i in range(0, 100): data[i] = torch.tensor(self.oudataframe.iloc[idx+i,0:]) return data
and subsequently attempting to check shape via:
dataloader = DataLoader(dataSet, batch_size=20, shuffle=False) data, target = next(iter(dataloader)) print(data.shape) print(target.shape)
I get:
data, target = next(iter(dataloader)) ValueError: too many values to unpack (expected 2)
It seems that I am not actually returning what I think I am, or my syntax is incorrect.

Edit:

After setting a test variable to dataSet[0], the shape of the test variable is indeed [100,2].
However, I don’t seem to be understanding how to iterate properly through the newly created dataloader object. Any insights?

ptrblck · August 4, 2019, 10:06pm

You just forgot to return the target tensor, as you are currently using return data.
Return the target as return data, target or just assign
data = next(iter(dataloader)).

Note, that you are currently returning overlapping “windows” in your approach and you should take care of indexing out of bounds (by using idx+100).
You could fix this by subtracting 100 from the overall length returned in def __len__(self).

goffta · August 4, 2019, 10:11pm

So, from what I seem to be understanding - I should have a seperate target tensor (i.e. the labels of each data point.) Should I seperate the (time, dy) components from the csv in the init method of the dataset?
Edit:
The first few lines of my csv are:
0.0, 1.0 0.1, 1.15478042212 0.2, 1.31513637888 0.30000000000000004, 1.12594414055 0.4, 0.688184511979 0.5, 0.249467795528 0.6000000000000001, 0.140746339238 0.7000000000000001, 0.0152185819711 0.8, 0.260358790336 0.9, -0.0533512260524
where column 0 is a step in time, and column 1 is the corresponding value. I took to this to be that t is my data (feature), and dy is my target (label). Am I thinking incorrectly?

ptrblck · August 4, 2019, 10:17pm

So your data should be the linearly increasing time steps, while your target is the (at first sight) random data?

I would split the data and target in __getitem__ and just return both values, as this makes the training loop cleaner in my opinion.
You could do it in __init__ or __getitem__ as it shouldn’t really matter.

goffta · August 4, 2019, 10:20pm

Yes, my data is just linearly increasing timesteps, while the targets were generated based upon a predefined function, and adding in gaussian white noise. It’s a 1D model of an Ornstein Uhlenbeck process. So, I should be splitting the data, ‘stacking’ as I have in the loop on both columns, and returning the ordered pair as a 100x1,100x1 tensor?

ptrblck · August 4, 2019, 10:24pm

Yes, that sounds right.
Let me know, how the experiments went!

goffta · August 4, 2019, 11:49pm

Will do. In the training loop I believe I should be able to do:


for data, target in enumerate(dataloader):
nn.LSTM(data)

And then use targets for my loss function… I think. With a data set of 400000 points, of this structure, I believe there should be around 2000 epochs to process all of the data. This also means that after I call DataLoader, len(dataloader) should return 2000?

goffta · August 5, 2019, 12:10am

This has been pretty odd. To actually access the correct data (t variable), I’ve had to only use the target. from the for data, target in enumerate(dataloader), data returns an integer, where target returns a tensor of [100,2] size. Is this something with how pandas handles things? The dataframe only has 2 indices, because trying to pull the target values specificially within the dataframe results in an out of bounds error.

Just for completeness here… I’m going to post my dataset method to avoid confusion.

# Data set formatting for easier input and processing by network.
class OUDataset(Dataset):

  def __init__(self, csv_file):
    self.oudataframe = pd.read_csv(csv_file)

  def __len__(self):
    return len(self.oudataframe) - 100
  
  def __getitem__(self, idx):
    data = torch.zeros(100, 1)
    target = torch.zeros(100,1)
    for i in range(0, 100):
      data[i] = torch.tensor(self.oudataframe.iloc[idx+i][0])
      target[i] = torch.tensor(self.oudataframe.iloc[idx+i][1])
    return [data, target]

######################################################################################

######################################################################################
# Function/Method definitions
######################################################################################

def main():
  k = 0 # iteration 0.
  print("Running!")
  print("Testing dataset class ...")
  dataSet = OUDataset(csv_file="../../trainingData/python/data/trainingDataOrnsteinUhlenbeck.dat")

  print("Moving to dataloader section ...")
  counter = 0 
  
  dataloader = DataLoader(dataSet, batch_size=20, shuffle=False)
  for data, target in enumerate(dataloader):
    print(data)
    print(target[0])
    print(target[1])
    break