Such as Amazon s3. And if the answer is yes, could you share with me the example?
Thanks for your help.
In TensorFlow, I can use “s3://” directly just like local file system directory. For example to train imagenet with ResNet, just use “python imagenet_main.py --data_dir=s3://dataset/ilsvrc12/tfrecord”. With boto3 the python code still need be changed.
I don’t know about this feature and how it works.
But I’m sure such contribution would be welcomed.
it looks pretty scary to me that Pytorch does not have support for SQL and cloud storage, does this means that no one is actually using it for large data sets?
The main reason is that it is simple to create such dataset in pytorch.
You can simply create a custom Dataset that loads the sample from s3 as you do in any python script.
Same thing for SQL, whatever you use to load data with python, you can use in a Dataset to get your sample.
Do you have any issue implementing those? You can ask here !
I think it is pretty simple to create an operating system if you are Linus, pretty simple to create Apple if you are Steve, and pretty easy to create Pytorch if you are Soumit. As a solution which is expected to work in distributed scalable resilient fault tolerant, real time manner things should get non trivial. Like I said I do not think anyone has done this yet with Pytorch hence such misconceptions that just using data sets can solve the problem.
I understand, hopefully, the builtin Dataset+Dataloader constructs will allow you to do everything you need without having to worry about these questions and it will “just work”.
If you see unexpected results, you can ask questions here.
I am genuinely sorry, but this is not a psychological counselling session, you do not need to understand me, but understand the design problems that I have written. Hope has nothing to do with engineering solutions for fundamental design problems. I think that your responses proves that no one is actually using Pytorch for solving real world, large scale solutions, but just academic based solutions working on mega bytes or at best giga byte scale problems.
You can look around online but I don’t think that is really true.
You can check the videos from the developer conference here for example. Or other blogpost about how to use pytorch in production.
To answer your design problem as I understand it, I would say that as soon as you start working with tera/peta bytes of data for your dataset. You are working in a private company that has its own infrastructure to support such data. And so PyTorch cannot provide a fast and reliable way to load data on an infrastructure that we don’t know about.
If you only have small datasets or few hundred gigabytes of data. You most likely don’t want to use cloud storage for your training as it will be slowing down your training significantly (even a spinning disk will most likely be the bottleneck for most network).
Did I understood your problem correctly?