We are heavily using torchdata and datapipes at Open Climate Fix in our GitHub - openclimatefix/ocf_datapipes: OCF's DataPipe based dataloader for training and inference repo. We have it working well with torchdata 0.4.1, but when upgrading to the newer PyTorch and torchdata 0.5.0, our CI tests timeout and fail because of memory issues. When running the same tests locally, with torchdata 0.4.1, 107 tests pass in 157 seconds, while with torchdata 0.5.0 its still running 10 minutes later with only 11 tests passed so far. On GitHub Actions, the tests timeout and run out of memory when updating to the newer version.
Looking through the release notes, there does not seem to be any real reason we can see why this is happening. We are not using any of the removed features, for example. Is there any changes between 0.4.1 and 0.5.0 not in the release notes that might have caused this?
weiji14
November 23, 2022, 9:21pm
2
There are lots of differences in library versions in Update Requirements by jacobbieker · Pull Request #86 · openclimatefix/ocf_datapipes · GitHub , and I see some libraries like rioxarray
, xarray
and fsspec
has been downgraded between the main branch at Merge branch 'issue/datamodule' · openclimatefix/ocf_datapipes@807ed6a · GitHub and in the PR at [pre-commit.ci] auto fixes from pre-commit.com hooks · openclimatefix/ocf_datapipes@f454c8b · GitHub . Can you run a pip list
for each environment (with torchdata 0.4.1 and 0.5.0) so that we know what the actual differences in library versions are and better isolate the cause of the timeouts?
Also, It seems like there are rioxarray.open_rasterio
calls (e.g. at ocf_datapipes/topographic.py at eb08124fdfb4deee23438984562dd7d1dd605bd5 · openclimatefix/ocf_datapipes · GitHub ) that are not made in a context manager, which might make it prone to memory leaks. There was a recent fix at Release Release 0.13.1 · corteva/rioxarray · GitHub , but not sure if it’s related to your issue here. Oh, and in addition to that, you might want to consider using StreamWrapper — TorchData 0.5.0 (beta) documentation for some of your datapipes to close the files properly. See e.g. zen3geo/rioxarray.py at a71639f9476dcf423dc806f4270f04938aa62739 · weiji14/zen3geo · GitHub for an example.
Hello,
Thanks for all the detail! I’ll start making some of those changes, and I’ve updated that branch to match main now. At the moment, for torchdata 0.4.1 the pip list
is
Package Version Editable project location
----------------------- ------------- -------------------------------------
affine 2.3.1
aiohttp 3.8.3
aiosignal 1.3.1
alembic 1.8.1
asciitree 0.3.3
async-timeout 4.0.2
attrs 22.1.0
bokeh 2.4.3
Bottleneck 1.3.5
branca 0.6.0
brotlipy 0.7.0
cached-property 1.5.2
Cartopy 0.21.0
certifi 2022.9.24
cffi 1.15.1
charset-normalizer 2.1.1
click 8.1.3
click-plugins 1.1.1
cligj 0.7.2
cloudpickle 2.2.0
configobj 5.0.6
contourpy 1.0.6
cramjam 2.6.2
cryptography 38.0.3
cycler 0.11.0
cytoolz 0.12.0
dask 2022.11.1
distributed 2022.11.1
einops 0.6.0
entrypoints 0.4
exceptiongroup 1.0.4
fasteners 0.18
fastparquet 2022.11.0
Fiona 1.8.22
fire 0.4.0
folium 0.13.0
fonttools 4.38.0
freezegun 1.2.2
frozenlist 1.3.3
fsspec 2022.11.0
GDAL 3.5.3
geopandas 0.12.1
gitdb 4.0.10
GitPython 3.1.29
greenlet 2.0.1
h5netcdf 0.0.0
h5py 3.7.0
HeapDict 1.0.1
idna 3.4
imagecodecs 2022.9.26
iniconfig 1.1.1
Jinja2 3.1.2
joblib 1.2.0
jpeg-xl-float-with-nans 0.0.4
kiwisolver 1.4.4
lightning-utilities 0.3.0
locket 1.0.0
lz4 4.0.2
Mako 1.2.4
mapclassify 2.4.3
MarkupSafe 2.1.1
matplotlib 3.6.2
msgpack 1.0.4
multidict 6.0.2
munch 2.5.0
munkres 1.1.4
networkx 2.8.8
nowcasting-datamodel 1.1.54
numcodecs 0.10.2
numpy 1.23.5
ocf-datapipes 0.5.30 /home/jacob/Development/ocf_datapipes
packaging 21.3
pandas 1.5.2
partd 1.3.0
pathy 0.10.0
Pillow 9.2.0
pip 22.3.1
pluggy 1.0.0
portalocker 2.6.0
protobuf 3.20.1
psutil 5.9.4
psycopg2-binary 2.9.5
pvlib 0.9.3
pvlive-api 0.11
pyaml-env 1.2.0
pycparser 2.21
pydantic 1.10.2
pykdtree 1.3.6
pyOpenSSL 22.1.0
pyparsing 3.0.9
pyproj 3.4.0
pyresample 1.25.1
pyshp 2.3.1
PySocks 1.7.1
pytest 7.2.0
python-dateutil 2.8.2
pytorch-lightning 1.8.3.post0
pytz 2022.6
PyYAML 5.4.1
rasterio 1.3.3
requests 2.28.1
rioxarray 0.13.1
Rtree 1.0.1
scikit-learn 1.1.3
scipy 1.9.3
setuptools 65.5.1
Shapely 1.8.5.post1
six 1.16.0
smart-open 5.2.1
smmap 5.0.0
snuggs 1.4.7
sortedcontainers 2.4.0
SQLAlchemy 1.4.44
tblib 1.7.0
tensorboardX 2.5.1
termcolor 2.1.1
threadpoolctl 3.1.0
tomli 2.0.1
toolz 0.12.0
torch 1.12.0
torchdata 0.4.1+f9ecd8b
torchmetrics 0.10.3
torchvision 0.13.0
tornado 6.1
tqdm 4.64.1
typer 0.7.0
typing_extensions 4.4.0
unicodedata2 15.0.0
urllib3 1.26.13
wheel 0.38.4
xarray 2022.11.0
xyzservices 2022.9.0
yarl 1.8.1
zarr 2.13.3
zict 2.2.0
And for torchdata 0.5.0 is:
Package Version Editable project location
----------------------- ----------- -------------------------------------
affine 2.3.1
aiohttp 3.8.3
aiosignal 1.3.1
alembic 1.8.1
asciitree 0.3.3
async-timeout 4.0.2
attrs 22.1.0
bokeh 2.4.3
Bottleneck 1.3.5
branca 0.6.0
brotlipy 0.7.0
cached-property 1.5.2
Cartopy 0.21.0
certifi 2022.9.24
cffi 1.15.1
charset-normalizer 2.1.1
click 8.1.3
click-plugins 1.1.1
cligj 0.7.2
cloudpickle 2.2.0
configobj 5.0.6
contourpy 1.0.6
cramjam 2.6.2
cryptography 38.0.3
cycler 0.11.0
cytoolz 0.12.0
dask 2022.11.1
distributed 2022.11.1
einops 0.6.0
entrypoints 0.4
exceptiongroup 1.0.4
execnet 1.9.0
fasteners 0.18
fastparquet 2022.11.0
Fiona 1.8.22
fire 0.4.0
folium 0.13.0
fonttools 4.38.0
freezegun 1.2.2
frozenlist 1.3.3
fsspec 2022.11.0
GDAL 3.5.3
geopandas 0.12.1
gitdb 4.0.9
GitPython 3.1.29
greenlet 2.0.1
h5netcdf 0.0.0
h5py 3.7.0
HeapDict 1.0.1
idna 3.4
imagecodecs 2022.9.26
iniconfig 1.1.1
Jinja2 3.1.2
joblib 1.2.0
jpeg-xl-float-with-nans 0.0.4
kiwisolver 1.4.4
lightning-utilities 0.3.0
locket 1.0.0
lz4 4.0.2
Mako 1.2.4
mapclassify 2.4.3
MarkupSafe 2.1.1
matplotlib 3.6.2
msgpack 1.0.4
multidict 6.0.2
munch 2.5.0
munkres 1.1.4
networkx 2.8.8
nowcasting-datamodel 1.1.54
numcodecs 0.10.2
numpy 1.23.5
ocf-datapipes 0.5.30 /home/jacob/Development/ocf_datapipes
packaging 21.3
pandas 1.5.2
partd 1.3.0
pathy 0.9.0
Pillow 9.2.0
pip 22.3.1
pluggy 1.0.0
portalocker 2.6.0
protobuf 3.20.1
psutil 5.9.4
psycopg2-binary 2.9.5
pvlib 0.9.3
pvlive-api 0.11
pyaml-env 1.2.0
pycparser 2.21
pydantic 1.10.2
pykdtree 1.3.6
pyOpenSSL 22.1.0
pyparsing 3.0.9
pyproj 3.4.0
pyresample 1.25.1
pyshp 2.3.1
PySocks 1.7.1
pytest 7.2.0
pytest-timeout 2.1.0
pytest-xdist 3.0.2
python-dateutil 2.8.2
pytorch-lightning 1.8.3.post0
pytz 2022.6
PyYAML 5.4.1
rasterio 1.3.3
requests 2.28.1
rioxarray 0.13.1
Rtree 1.0.1
scikit-learn 1.1.3
scipy 1.9.3
setuptools 65.5.1
Shapely 1.8.5.post1
six 1.16.0
smart-open 5.2.1
smmap 5.0.0
snuggs 1.4.7
sortedcontainers 2.4.0
SQLAlchemy 1.4.44
tblib 1.7.0
tensorboardX 2.5.1
termcolor 2.1.1
threadpoolctl 3.1.0
tomli 2.0.1
toolz 0.12.0
torch 1.13.0
torchdata 0.5.0
torchmetrics 0.10.3
torchvision 0.14.0
tornado 6.1
tqdm 4.64.1
typer 0.7.0
typing_extensions 4.4.0
unicodedata2 15.0.0
urllib3 1.26.12
wheel 0.38.4
xarray 2022.11.0
xyzservices 2022.9.0
yarl 1.8.1
zarr 2.13.3
zict 2.2.0
weiji14
November 25, 2022, 1:40am
5
Ok, I see you’ve made some force pushes… The main diffs seem to be on torch
, torchdata
and torchvision
now:
diff --git a/requirements.txt b/requirements.txt
index 1269b7be..09bebd84 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -31,6 +31,7 @@ distributed 2022.11.1
einops 0.6.0
entrypoints 0.4
exceptiongroup 1.0.4
+execnet 1.9.0
fasteners 0.18
fastparquet 2022.11.0
Fiona 1.8.22
@@ -42,7 +43,7 @@ frozenlist 1.3.3
fsspec 2022.11.0
GDAL 3.5.3
geopandas 0.12.1
-gitdb 4.0.10
+gitdb 4.0.9
GitPython 3.1.29
greenlet 2.0.1
h5netcdf 0.0.0
@@ -74,7 +75,7 @@ ocf-datapipes 0.5.30 /home/jacob/Development/ocf_datapipes
:...skipping...
diff --git a/requirements.txt b/requirements.txt
index 1269b7be..09bebd84 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -31,6 +31,7 @@ distributed 2022.11.1
einops 0.6.0
entrypoints 0.4
exceptiongroup 1.0.4
+execnet 1.9.0
fasteners 0.18
fastparquet 2022.11.0
Fiona 1.8.22
@@ -42,7 +43,7 @@ frozenlist 1.3.3
fsspec 2022.11.0
GDAL 3.5.3
geopandas 0.12.1
-gitdb 4.0.10
+gitdb 4.0.9
GitPython 3.1.29
greenlet 2.0.1
h5netcdf 0.0.0
@@ -74,7 +75,7 @@ ocf-datapipes 0.5.30 /home/jacob/Development/ocf_datapipes
packaging 21.3
pandas 1.5.2
partd 1.3.0
-pathy 0.10.0
+pathy 0.9.0
Pillow 9.2.0
pip 22.3.1
pluggy 1.0.0
@@ -95,6 +96,8 @@ pyresample 1.25.1
pyshp 2.3.1
PySocks 1.7.1
pytest 7.2.0
+pytest-timeout 2.1.0
+pytest-xdist 3.0.2
python-dateutil 2.8.2
pytorch-lightning 1.8.3.post0
pytz 2022.6
@@ -119,16 +122,16 @@ termcolor 2.1.1
threadpoolctl 3.1.0
tomli 2.0.1
toolz 0.12.0
-torch 1.12.0
-torchdata 0.4.1+f9ecd8b
+torch 1.13.0
+torchdata 0.5.0
torchmetrics 0.10.3
-torchvision 0.13.0
+torchvision 0.14.0
tornado 6.1
tqdm 4.64.1
typer 0.7.0
typing_extensions 4.4.0
unicodedata2 15.0.0
-urllib3 1.26.13
+urllib3 1.26.12
wheel 0.38.4
xarray 2022.11.0
xyzservices 2022.9.0
Can you disable pytest-xdist
temporarily at ocf_datapipes/workflows.yaml at 86f2a1397a6696c536b692d64cff16dc25e8afe9 · openclimatefix/ocf_datapipes · GitHub (i.e. run in serial) to see if it’s possible to isolate what test it’s crashing on? Also, I think your yield
statement in the __iter__
function is still outside of the with rioxarray.open_rasterio(...)
block, though I’m not 100% sure if that’s the main issue anymore. It might actually be something in pytorch
/torchdata
itself.
As an aside, have you considered using something like xbatcher
to read smaller slices of data at a time? Depends on what your data looks like of course
jacobbieker
(Jacob Bieker)
November 25, 2022, 12:35pm
6
Yeah, I force pushed to rebase the branch on the newest version so there is less changes between it and the main branch. And yeah, the yield still is, I can move it inside though. I have heard of xbatcher, and will look into it, our data is a mix of different xarray datasets with different spatial coordinates systems loaded all at once, and we have to compute the temporal and spatial overlaps and then select batches of data based off those intersections, which seems like it might be a bit more complicated than what xbatcher is more designed for.
weiji14
November 26, 2022, 2:48am
8
Right, so unclosed file streams are messing up at the ZipperIterDataPipe
in torchdata=0.5.0
according to Dataloader does not stop if one of the zipped DataPipes has a perpetual cycle · Issue #865 · pytorch/data · GitHub . So either wait for a bugfix, pin to torchdata=0.4.1
, or handle yield statements properly
jacobbieker:
I have heard of xbatcher, and will look into it, our data is a mix of different xarray datasets with different spatial coordinates systems loaded all at once, and we have to compute the temporal and spatial overlaps and then select batches of data based off those intersections, which seems like it might be a bit more complicated than what xbatcher is more designed for.
Hmm, xbatcher.BatchGenerator
does have an input_overlap
parameter, but I’m guessing you’re after something a bit more advanced. There is a possibly related issue open at Handling of overlapping samples and shuffling · Issue #30 · xarray-contrib/xbatcher · GitHub , and you’re welcome to articulate your use-case there or open another issue. Just want to save you from having to reinvent the wheel.
ejguan
(Erjia)
November 28, 2022, 3:55pm
9
Thank you guys for keeping us posted about the bug you found. I will send a patch ASAP and let you know when the new nightly release becomes available for you to test.
ejguan
(Erjia)
December 5, 2022, 6:58pm
10
After [4/4][DataPipe] Remove iterator depletion in Zipper by ejguan · Pull Request #89974 · pytorch/pytorch · GitHub is landed, this issue should have been fixed. I have tested the newly created tests locally.
You should be able to find the nightly releases for both torch
and torchdata
tomorrow. In terms of how to install nightly releases, pls check the official guidance
pip3 install --pre torch torchdata --extra-index-url https://download.pytorch.org/whl/nightly/cpu
You can replace cpu
by cu116
, cu117
or rocm5.2
to install torch
based on your requirement.