Loading dataset from .pt file using torch.load

Arij-Aladel · May 8, 2022, 2:01pm

Hello !
@ptrblck_de
I have preprocessed a adataset and stored the preprocessed dataset as .pt file.
From a while I have deleted the .cache folder totally

Now loading the data set using torch.load(“filename.pt”)
gives me this error
/home/arij/.cache/huggingface/datasets/hotpot_qa/distractor/1.0.0/133b9501f892e5193babbad937bee3b4899deb4691ef4d791e6ac0111c875bb5/cache-be3bdbe7ba4c6f40.arrow’. Detail: [errno 2] No such file or directory

I just can not understand if I am loading the file from physical location why torch searching the cache??? Do I need to preprocess the dataset again or there is a solution???

Megh_Bhalerao · May 9, 2022, 11:06pm

I have the same question, did you find a solution?

Arij-Aladel · May 10, 2022, 5:49am

no, I could not find reasonable linking for this error. It is just not reasonable after processing ans saving all in a file to load from cache. I have preprocessed the data again, but this is not reasinable solution for me

JuanFMontesinos · May 10, 2022, 8:53am

Sincerely you should be using numpy, not torch.

I don’t know the details but in the end pytorch pt files are pickle objects which store any sort of info for which all the dependencies are required during the serialization.
Saving np arrays in a npy file just requires numpy and allows you to use mmap for efficient loading.
If you aim to save more complex structures then you should prob go for hdf5

Megh_Bhalerao · May 18, 2022, 4:17am

@Arij-Aladel , as @JuanFMontesinos mentioned while loading you need to import any python files that might be used by the dataset class which is being saved (or the data loader class). I was able to get mine working after importing some .py files which I was not importing earlier.

Arij-Aladel · May 24, 2022, 3:06pm

I really can not understand your answers I did not mentiomed the numpy or any other library. I just used torch to save processed dataset and I wnated to load it and I could not.

InnovArul · May 24, 2022, 5:25pm

I doubt that the error is not related to your code, but it is related to huggingface library. It expects hotspotQA dataset to be present in the .cache folder it seems.
Can you try the following code to see if it downloads the dataset?

from datasets import load_dataset
dataset = load_dataset("hotpot_qa", "distractor")

(or)
possibly eliminate or comment on the code that might try to use this dataset?

Arij-Aladel · May 24, 2022, 6:28pm

I have already tried before asking any way I preprocessed the dataset again becuase I did not find a solution