Exception handling when using Datapipes

Hi everyone!

I am trying to load files using the HTTPReaderIterDataPipe from pytorch/data.
How do I handle exceptions (e.g. timeouts) while iterating through the URLs?
I would like to skip the URL causing problems and just move on to the next one. This seems to be impossible when using the functional form.

Is there a DataPipe (catching exceptions and moving on) exactly for this purpose or is my understanding of how to do exception handling when using DataPipes incorrect?

Hi,

Thanks for using TorchData. While timeout is accepted as an argument, I don’t think there is a built-in way within HTTPReader to handle exception in a bespoke way.

If you would like to skip the URL causing problems, you can consider using a .filter prior to HTTPReader to check if it is possible to establish a connection.

I don’t think that will fully address your issue so I think other options are:

  1. Build on top of HTTPReader but overrides its exception handling (probably rewrite __iter__)
  2. Write a new DataPipe that is able to catch exception coming from a source DataPipe
    • I think catching the exception is feasible, I’m less sure about resuming the DataPipe/Iterator after an exception is raised

We would accept a PR for 2 if you have a good implementation. Happy to discuss further.

cc: @ejguan

Thanks!
option 1 is what I currently use. Wouldn’t it be better to just add this functionality (as an option) to the HTTPReader? I have opened an issue regarding this (Modify exception handling of online Datapipes · Issue #963 · pytorch/data · GitHub)
As I personally do not think that option 2 is that useful (for me at least) if continuing is impossible, I will not be working on this atm. If you think otherwise please let me know. I would be happy to help.