Should I use an IterableDataset? Data is similar to a directory of videos and model would look at sequences within each video

So I am not working with video files but a custom file type of video game replays. The files contain game states and player inputs, and I’m trying to build a model that can predict player inputs. I will need to do a fair amount of preprocessing and I would like to test out a few different model types (RNN, LSTM, Transformers/Attention). Since the nature of the data is sequential, I will need to take subsequences from within each file.

For now I am not implementing a Dataset nor a Dataloader. I am following this tutorial which doesn’t use them either and I’m afraid it will make the training much slower. I suppose I’ll find out for sure when I get there but in the mean time I would appreciate any help or insight on this.

https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

Thank you!

This might be a better explanation of what I am trying to do.

I have a directory called text_files that has the following:

text1.txt
text2.txt
text3.txt

The content of each file is as follows:

my first text file
here is another text file that is a little longer
a third text file for extra measure

I need to iterate through each text file and return a single word at a time. Assume that each word maps to a token, and I’ll use 0 as a padding token. I would like my dataloader to load each record like so:

[my, 0, 0, 0, 0, 0, 0, 0, 0, 0, ]
[my, first, 0, 0, 0, 0, 0, 0, 0, 0, ]
[my, first, text, 0, 0, 0, 0, 0, 0, 0, ]
[my, first, text, file, 0, 0, 0, 0, 0, 0, ]
[here, 0, 0, 0, 0, 0, 0, 0, 0, 0, ]
[here, is, 0, 0, 0, 0, 0, 0, 0, 0, ]
[here, is, another, 0, 0, 0, 0, 0, 0, 0, ]

and so on. I’m running my head around in circles trying to figure out which combination of dataset, dataloader and collate function I would need to incorporate this.

After posting this, I think my use case is unique enough that I’ll just build my own class to handle everything.