Stack frames for DNN input

I have a dataset of N audio frames of size (N , 13), and for the task of phoneme recognition

train_data = torch.hstack((train_feat, train_labels))
train_loader =, batch_size= 128, shuffle=True)
torch.Size([3082092, 14])

How can I stack 7 frames each time to feed to the DNN?

I’m not sure which dimension refers to the “frame dimension”, but you could probably either use torch.stack or to create the new stacked/concatenated tensor.

Hi @ptrblck . The x_train dataset is 3082092 frames. Each frame has 13 numbers (features).
The y_train is 3082092 digit (labels).
That is for each frame (1,13) there’s one label…
Now, feeding pne frame to the DNN is not going to work because there’s too small information in it. Instead,I would like to stack a sequence of 7 frames (of those 3082092). I hope that makes sense.

Thanks for the explanation. The 7 frames would thus correspond to the batch size and you could set it in the DataLoader.

In that case, the DNN would look at them as 7 individual units, but I wanted to stack 7 frames as one unit.

Could you explain how one “unit” would be processed in the model and what the expected input shape would thus be?