Torchdata 0.5.0 is seemingly a lot slower and more memory usage than 0.4.1

We are heavily using torchdata and datapipes at Open Climate Fix in our GitHub - openclimatefix/ocf_datapipes: OCF's DataPipe based dataloader for training and inference repo. We have it working well with torchdata 0.4.1, but when upgrading to the newer PyTorch and torchdata 0.5.0, our CI tests timeout and fail because of memory issues. When running the same tests locally, with torchdata 0.4.1, 107 tests pass in 157 seconds, while with torchdata 0.5.0 its still running 10 minutes later with only 11 tests passed so far. On GitHub Actions, the tests timeout and run out of memory when updating to the newer version.

Looking through the release notes, there does not seem to be any real reason we can see why this is happening. We are not using any of the removed features, for example. Is there any changes between 0.4.1 and 0.5.0 not in the release notes that might have caused this?

There are lots of differences in library versions in Update Requirements by jacobbieker · Pull Request #86 · openclimatefix/ocf_datapipes · GitHub, and I see some libraries like rioxarray, xarray and fsspec has been downgraded between the main branch at Merge branch 'issue/datamodule' · openclimatefix/ocf_datapipes@807ed6a · GitHub and in the PR at [pre-commit.ci] auto fixes from pre-commit.com hooks · openclimatefix/ocf_datapipes@f454c8b · GitHub. Can you run a pip list for each environment (with torchdata 0.4.1 and 0.5.0) so that we know what the actual differences in library versions are and better isolate the cause of the timeouts?

Also, It seems like there are rioxarray.open_rasterio calls (e.g. at ocf_datapipes/topographic.py at eb08124fdfb4deee23438984562dd7d1dd605bd5 · openclimatefix/ocf_datapipes · GitHub) that are not made in a context manager, which might make it prone to memory leaks. There was a recent fix at Release Release 0.13.1 · corteva/rioxarray · GitHub, but not sure if it’s related to your issue here. Oh, and in addition to that, you might want to consider using StreamWrapper — TorchData 0.5.0 (beta) documentation for some of your datapipes to close the files properly. See e.g. zen3geo/rioxarray.py at a71639f9476dcf423dc806f4270f04938aa62739 · weiji14/zen3geo · GitHub for an example.

Hello,

Thanks for all the detail! I’ll start making some of those changes, and I’ve updated that branch to match main now. At the moment, for torchdata 0.4.1 the pip list is

Package                 Version       Editable project location
----------------------- ------------- -------------------------------------
affine                  2.3.1
aiohttp                 3.8.3
aiosignal               1.3.1
alembic                 1.8.1
asciitree               0.3.3
async-timeout           4.0.2
attrs                   22.1.0
bokeh                   2.4.3
Bottleneck              1.3.5
branca                  0.6.0
brotlipy                0.7.0
cached-property         1.5.2
Cartopy                 0.21.0
certifi                 2022.9.24
cffi                    1.15.1
charset-normalizer      2.1.1
click                   8.1.3
click-plugins           1.1.1
cligj                   0.7.2
cloudpickle             2.2.0
configobj               5.0.6
contourpy               1.0.6
cramjam                 2.6.2
cryptography            38.0.3
cycler                  0.11.0
cytoolz                 0.12.0
dask                    2022.11.1
distributed             2022.11.1
einops                  0.6.0
entrypoints             0.4
exceptiongroup          1.0.4
fasteners               0.18
fastparquet             2022.11.0
Fiona                   1.8.22
fire                    0.4.0
folium                  0.13.0
fonttools               4.38.0
freezegun               1.2.2
frozenlist              1.3.3
fsspec                  2022.11.0
GDAL                    3.5.3
geopandas               0.12.1
gitdb                   4.0.10
GitPython               3.1.29
greenlet                2.0.1
h5netcdf                0.0.0
h5py                    3.7.0
HeapDict                1.0.1
idna                    3.4
imagecodecs             2022.9.26
iniconfig               1.1.1
Jinja2                  3.1.2
joblib                  1.2.0
jpeg-xl-float-with-nans 0.0.4
kiwisolver              1.4.4
lightning-utilities     0.3.0
locket                  1.0.0
lz4                     4.0.2
Mako                    1.2.4
mapclassify             2.4.3
MarkupSafe              2.1.1
matplotlib              3.6.2
msgpack                 1.0.4
multidict               6.0.2
munch                   2.5.0
munkres                 1.1.4
networkx                2.8.8
nowcasting-datamodel    1.1.54
numcodecs               0.10.2
numpy                   1.23.5
ocf-datapipes           0.5.30        /home/jacob/Development/ocf_datapipes
packaging               21.3
pandas                  1.5.2
partd                   1.3.0
pathy                   0.10.0
Pillow                  9.2.0
pip                     22.3.1
pluggy                  1.0.0
portalocker             2.6.0
protobuf                3.20.1
psutil                  5.9.4
psycopg2-binary         2.9.5
pvlib                   0.9.3
pvlive-api              0.11
pyaml-env               1.2.0
pycparser               2.21
pydantic                1.10.2
pykdtree                1.3.6
pyOpenSSL               22.1.0
pyparsing               3.0.9
pyproj                  3.4.0
pyresample              1.25.1
pyshp                   2.3.1
PySocks                 1.7.1
pytest                  7.2.0
python-dateutil         2.8.2
pytorch-lightning       1.8.3.post0
pytz                    2022.6
PyYAML                  5.4.1
rasterio                1.3.3
requests                2.28.1
rioxarray               0.13.1
Rtree                   1.0.1
scikit-learn            1.1.3
scipy                   1.9.3
setuptools              65.5.1
Shapely                 1.8.5.post1
six                     1.16.0
smart-open              5.2.1
smmap                   5.0.0
snuggs                  1.4.7
sortedcontainers        2.4.0
SQLAlchemy              1.4.44
tblib                   1.7.0
tensorboardX            2.5.1
termcolor               2.1.1
threadpoolctl           3.1.0
tomli                   2.0.1
toolz                   0.12.0
torch                   1.12.0
torchdata               0.4.1+f9ecd8b
torchmetrics            0.10.3
torchvision             0.13.0
tornado                 6.1
tqdm                    4.64.1
typer                   0.7.0
typing_extensions       4.4.0
unicodedata2            15.0.0
urllib3                 1.26.13
wheel                   0.38.4
xarray                  2022.11.0
xyzservices             2022.9.0
yarl                    1.8.1
zarr                    2.13.3
zict                    2.2.0

And for torchdata 0.5.0 is:

Package                 Version     Editable project location
----------------------- ----------- -------------------------------------
affine                  2.3.1
aiohttp                 3.8.3
aiosignal               1.3.1
alembic                 1.8.1
asciitree               0.3.3
async-timeout           4.0.2
attrs                   22.1.0
bokeh                   2.4.3
Bottleneck              1.3.5
branca                  0.6.0
brotlipy                0.7.0
cached-property         1.5.2
Cartopy                 0.21.0
certifi                 2022.9.24
cffi                    1.15.1
charset-normalizer      2.1.1
click                   8.1.3
click-plugins           1.1.1
cligj                   0.7.2
cloudpickle             2.2.0
configobj               5.0.6
contourpy               1.0.6
cramjam                 2.6.2
cryptography            38.0.3
cycler                  0.11.0
cytoolz                 0.12.0
dask                    2022.11.1
distributed             2022.11.1
einops                  0.6.0
entrypoints             0.4
exceptiongroup          1.0.4
execnet                 1.9.0
fasteners               0.18
fastparquet             2022.11.0
Fiona                   1.8.22
fire                    0.4.0
folium                  0.13.0
fonttools               4.38.0
freezegun               1.2.2
frozenlist              1.3.3
fsspec                  2022.11.0
GDAL                    3.5.3
geopandas               0.12.1
gitdb                   4.0.9
GitPython               3.1.29
greenlet                2.0.1
h5netcdf                0.0.0
h5py                    3.7.0
HeapDict                1.0.1
idna                    3.4
imagecodecs             2022.9.26
iniconfig               1.1.1
Jinja2                  3.1.2
joblib                  1.2.0
jpeg-xl-float-with-nans 0.0.4
kiwisolver              1.4.4
lightning-utilities     0.3.0
locket                  1.0.0
lz4                     4.0.2
Mako                    1.2.4
mapclassify             2.4.3
MarkupSafe              2.1.1
matplotlib              3.6.2
msgpack                 1.0.4
multidict               6.0.2
munch                   2.5.0
munkres                 1.1.4
networkx                2.8.8
nowcasting-datamodel    1.1.54
numcodecs               0.10.2
numpy                   1.23.5
ocf-datapipes           0.5.30      /home/jacob/Development/ocf_datapipes
packaging               21.3
pandas                  1.5.2
partd                   1.3.0
pathy                   0.9.0
Pillow                  9.2.0
pip                     22.3.1
pluggy                  1.0.0
portalocker             2.6.0
protobuf                3.20.1
psutil                  5.9.4
psycopg2-binary         2.9.5
pvlib                   0.9.3
pvlive-api              0.11
pyaml-env               1.2.0
pycparser               2.21
pydantic                1.10.2
pykdtree                1.3.6
pyOpenSSL               22.1.0
pyparsing               3.0.9
pyproj                  3.4.0
pyresample              1.25.1
pyshp                   2.3.1
PySocks                 1.7.1
pytest                  7.2.0
pytest-timeout          2.1.0
pytest-xdist            3.0.2
python-dateutil         2.8.2
pytorch-lightning       1.8.3.post0
pytz                    2022.6
PyYAML                  5.4.1
rasterio                1.3.3
requests                2.28.1
rioxarray               0.13.1
Rtree                   1.0.1
scikit-learn            1.1.3
scipy                   1.9.3
setuptools              65.5.1
Shapely                 1.8.5.post1
six                     1.16.0
smart-open              5.2.1
smmap                   5.0.0
snuggs                  1.4.7
sortedcontainers        2.4.0
SQLAlchemy              1.4.44
tblib                   1.7.0
tensorboardX            2.5.1
termcolor               2.1.1
threadpoolctl           3.1.0
tomli                   2.0.1
toolz                   0.12.0
torch                   1.13.0
torchdata               0.5.0
torchmetrics            0.10.3
torchvision             0.14.0
tornado                 6.1
tqdm                    4.64.1
typer                   0.7.0
typing_extensions       4.4.0
unicodedata2            15.0.0
urllib3                 1.26.12
wheel                   0.38.4
xarray                  2022.11.0
xyzservices             2022.9.0
yarl                    1.8.1
zarr                    2.13.3
zict                    2.2.0

CC @nivek @ejguan PTAL

Ok, I see you’ve made some force pushes… The main diffs seem to be on torch, torchdata and torchvision now:

diff --git a/requirements.txt b/requirements.txt
index 1269b7be..09bebd84 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -31,6 +31,7 @@ distributed             2022.11.1
 einops                  0.6.0
 entrypoints             0.4
 exceptiongroup          1.0.4
+execnet                 1.9.0
 fasteners               0.18
 fastparquet             2022.11.0
 Fiona                   1.8.22
@@ -42,7 +43,7 @@ frozenlist              1.3.3
 fsspec                  2022.11.0
 GDAL                    3.5.3
 geopandas               0.12.1
-gitdb                   4.0.10
+gitdb                   4.0.9
 GitPython               3.1.29
 greenlet                2.0.1
 h5netcdf                0.0.0
@@ -74,7 +75,7 @@ ocf-datapipes           0.5.30        /home/jacob/Development/ocf_datapipes
:...skipping...
diff --git a/requirements.txt b/requirements.txt
index 1269b7be..09bebd84 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -31,6 +31,7 @@ distributed             2022.11.1
 einops                  0.6.0
 entrypoints             0.4
 exceptiongroup          1.0.4
+execnet                 1.9.0
 fasteners               0.18
 fastparquet             2022.11.0
 Fiona                   1.8.22
@@ -42,7 +43,7 @@ frozenlist              1.3.3
 fsspec                  2022.11.0
 GDAL                    3.5.3
 geopandas               0.12.1
-gitdb                   4.0.10
+gitdb                   4.0.9
 GitPython               3.1.29
 greenlet                2.0.1
 h5netcdf                0.0.0
@@ -74,7 +75,7 @@ ocf-datapipes           0.5.30        /home/jacob/Development/ocf_datapipes
 packaging               21.3
 pandas                  1.5.2
 partd                   1.3.0
-pathy                   0.10.0
+pathy                   0.9.0
 Pillow                  9.2.0
 pip                     22.3.1
 pluggy                  1.0.0
@@ -95,6 +96,8 @@ pyresample              1.25.1
 pyshp                   2.3.1
 PySocks                 1.7.1
 pytest                  7.2.0
+pytest-timeout          2.1.0
+pytest-xdist            3.0.2
 python-dateutil         2.8.2
 pytorch-lightning       1.8.3.post0
 pytz                    2022.6
@@ -119,16 +122,16 @@ termcolor               2.1.1
 threadpoolctl           3.1.0
 tomli                   2.0.1
 toolz                   0.12.0
-torch                   1.12.0
-torchdata               0.4.1+f9ecd8b
+torch                   1.13.0
+torchdata               0.5.0
 torchmetrics            0.10.3
-torchvision             0.13.0
+torchvision             0.14.0
 tornado                 6.1
 tqdm                    4.64.1
 typer                   0.7.0
 typing_extensions       4.4.0
 unicodedata2            15.0.0
-urllib3                 1.26.13
+urllib3                 1.26.12
 wheel                   0.38.4
 xarray                  2022.11.0
 xyzservices             2022.9.0

Can you disable pytest-xdist temporarily at ocf_datapipes/workflows.yaml at 86f2a1397a6696c536b692d64cff16dc25e8afe9 · openclimatefix/ocf_datapipes · GitHub (i.e. run in serial) to see if it’s possible to isolate what test it’s crashing on? Also, I think your yield statement in the __iter__ function is still outside of the with rioxarray.open_rasterio(...) block, though I’m not 100% sure if that’s the main issue anymore. It might actually be something in pytorch/torchdata itself.

As an aside, have you considered using something like xbatcher to read smaller slices of data at a time? Depends on what your data looks like of course :slightly_smiling_face:

Yeah, I force pushed to rebase the branch on the newest version so there is less changes between it and the main branch. And yeah, the yield still is, I can move it inside though. I have heard of xbatcher, and will look into it, our data is a mix of different xarray datasets with different spatial coordinates systems loaded all at once, and we have to compute the temporal and spatial overlaps and then select batches of data based off those intersections, which seems like it might be a bit more complicated than what xbatcher is more designed for.

I think a found the issue in Dataloader does not stop if one of the zipped DataPipes has a perpetual cycle · Issue #865 · pytorch/data · GitHub
Essentially the Zipper datapipe has changed behaviour a bit

Right, so unclosed file streams are messing up at the ZipperIterDataPipe in torchdata=0.5.0 according to Dataloader does not stop if one of the zipped DataPipes has a perpetual cycle · Issue #865 · pytorch/data · GitHub. So either wait for a bugfix, pin to torchdata=0.4.1, or handle yield statements properly :slightly_smiling_face:

Hmm, xbatcher.BatchGenerator does have an input_overlap parameter, but I’m guessing you’re after something a bit more advanced. There is a possibly related issue open at Handling of overlapping samples and shuffling · Issue #30 · xarray-contrib/xbatcher · GitHub, and you’re welcome to articulate your use-case there or open another issue. Just want to save you from having to reinvent the wheel.

Thank you guys for keeping us posted about the bug you found. I will send a patch ASAP and let you know when the new nightly release becomes available for you to test.

After [4/4][DataPipe] Remove iterator depletion in Zipper by ejguan · Pull Request #89974 · pytorch/pytorch · GitHub is landed, this issue should have been fixed. I have tested the newly created tests locally.

You should be able to find the nightly releases for both torch and torchdata tomorrow. In terms of how to install nightly releases, pls check the official guidance

pip3 install --pre torch torchdata  --extra-index-url https://download.pytorch.org/whl/nightly/cpu

You can replace cpu by cu116, cu117 or rocm5.2 to install torch based on your requirement.