Exception handling when using Datapipes

SvenDS9 · January 24, 2023, 11:03am

Hi everyone!

I am trying to load files using the HTTPReaderIterDataPipe from pytorch/data.
How do I handle exceptions (e.g. timeouts) while iterating through the URLs?
I would like to skip the URL causing problems and just move on to the next one. This seems to be impossible when using the functional form.

Is there a DataPipe (catching exceptions and moving on) exactly for this purpose or is my understanding of how to do exception handling when using DataPipes incorrect?

nivek · January 24, 2023, 9:24pm

Hi,

Thanks for using TorchData. While timeout is accepted as an argument, I don’t think there is a built-in way within HTTPReader to handle exception in a bespoke way.

If you would like to skip the URL causing problems, you can consider using a .filter prior to HTTPReader to check if it is possible to establish a connection.

I don’t think that will fully address your issue so I think other options are:

Build on top of HTTPReader but overrides its exception handling (probably rewrite __iter__)
Write a new DataPipe that is able to catch exception coming from a source DataPipe
- I think catching the exception is feasible, I’m less sure about resuming the DataPipe/Iterator after an exception is raised

We would accept a PR for 2 if you have a good implementation. Happy to discuss further.

cc: @ejguan

SvenDS9 · January 25, 2023, 2:35pm

Thanks!
option 1 is what I currently use. Wouldn’t it be better to just add this functionality (as an option) to the HTTPReader? I have opened an issue regarding this (Modify exception handling of online Datapipes · Issue #963 · pytorch/data · GitHub)
As I personally do not think that option 2 is that useful (for me at least) if continuing is impossible, I will not be working on this atm. If you think otherwise please let me know. I would be happy to help.