Map-style dataset with unknown length


I have a relatively large dataset, which will be assembled from raw data at train time (raw data is downloaded and preprocessed live, the preprocessed tensors are going to be “cached” on the hard drive for the training, as they fit on the hard drive but not into the memory).
Due to data corruption or network issues, I expect a nonzero amount of data that will not reach me.

Is it possible to make a map-style dataset robust against this problem? After the first epoch, the length is well defined and I can set it, but during the first epoch, I can not.
Does that mean I am forced to preprocess all data ahead of time in this case?

If at all possible, I would like to avoid using an Iterable-style dataset.