(libtorch) How to use torch::data::datasets for custom dataset?

Hi everyone,

Following the https://github.com/goldsborough/examples/blob/cpp/cpp/mnist/mnist.cpp, I am trying to write my own program for training using libtorch.

However, I don’t find any documents about how to load my own dataset.
It seems that the C++ API is similar to Python API.

First, using torch::data::datasets to create a object of dataset.
Second, using torch::data::make_data_loader to create a pointer of loader.

But, I don’t know how to define custom dataset using torch::data::datasets. Does anyone can help me?

2 Likes

Hi @Hengd,

well actually it is super easy :slight_smile:. Just as in this example for the MNIST dataset, you can implement a torch::data::datasets::Dataset<Self, SingleExample>. Therefore, you need to override the get(size_t index) method from Dataset. What you need to do, is to get your data from somewhere and convert it into a Tensor, but this is up to you.

#include <torch/torch.h>

// You can for example just read your data and directly store it as tensor.
torch::Tensor read_data(const std::string& loc)
{
    torch::Tensor tensor = ...

    // Here you need to get your data.

    return tensor;
};

class MyDataset : public torch::data::Dataset<MyDataset>
{
    private:
        torch::Tensor states_, labels_;

    public:
        explicit MyDataset(const std::string& loc_states, const std::string& loc_labels) 
            : states_(read_data(loc_states)),
              labels_(read_data(loc_labels) {   };

        torch::data::Example<> get(size_t index) override;
};

torch::data::Example<> MyDataset::get(size_t index)
{
    // You may for example also read in a .csv file that stores locations
    // to your data and then read in the data at this step. Be creative.
    return {states_[index], labels_[index]};
} 

Then, you want to generate a data loader from it, just do

// Generate your data set. At this point you can add transforms to you data set, e.g. stack your
// batches into a single tensor.
auto data_set = MyDataset(loc_states, loc_labels).map(torch::data::transforms::Stack<>());

// Generate a data loader.
auto data_loader = torch::data::make_data_loader<torch::data::samplers::SequentialSampler>(
    std::move(data_set), 
    batch_size);

// In a for loop you can now use your data.
for (auto& batch : data_loader) {
    auto data = batch.data;
    auto labels = batch.target;
    // do your usual stuff
}

Hopefully this helps, although I don’t know the kind of data you are trying to read in.

Martin

5 Likes

Hello,
Were you able to come up with a working example that mimics that of the PyTorch-based dataset for reading images?

class GenericDataset(torch.utils.data.Dataset):
  def __init__(self, labels, root_dir, subset=False, transform=None):
    self.labels = labels
    self.root_dir = root_dir
    self.transform = transform

  def __len__(self):
    return len(self.labels)

  def __getitem__(self, idx):
    img_name = self.labels.iloc[idx, 0]  # file name
    fullname = join(self.root_dir, img_name)
    image = Image.open(fullname).convert('RGB')
    labels = self.labels.iloc[idx, 2]  # category_id
    #         print (labels)
    if self.transform:
      image = self.transform(image)
    return image, labels

Thanks,

Hi @dambo,

yes, the above example mimics the PyTorch version of a dataset.

I will implement an example which clarifies it further and post the link here.

I have now implemented a little classifier with a custom dataset that classifies apples and bananas. You can find it here

6 Likes

Thanks you very much.

help a lot!!
thank you!!

Hi, I found that the example only contains the data and target, how can i do while my data contains many components. (for example, the sentence simlilarity classfication dataset, every item of this dataset contains 2 sentences and a label, for this dataset, I would like to define sentence1, sentence2 and label rather than image and labels)
How can I do that? thanks!
some python code are follow:

class MyDataset(torch.utils.data.Dataset):
  def __init__(self, text1,text2,labels):
    self.labels = labels
    self.sentence1 = text1
    self.sentence2 = text2
......

how do you encode your sentences? Are they one hot encoded? Or do you read them in as strings, and then encode them somehow?

yes, they are one hot encoded.

Thank you for your example how to use libtorch to create own datasets- loaders. Ive followed your example and created a read_data() function that returns a tensor from a csv file by first creating a vector then flatten the vector and then creating tensors in the right shape by using from_blob.
Here is the output of my input and output vectors and tensors (each row is an observation):

Data Vector: 
9 1 4 2 6 
2 7 5 2 3 
4 3 7 8 5 
5 2 4 7 9 
Flat Vector: 
9 1 4 2 6 2 7 5 2 3 4 3 7 8 5 5 2 4 7 9 
Input Tensor: 
 9  1  4  2  6
 2  7  5  2  3
 4  3  7  8  5
 5  2  4  7  9
[ CPUDoubleType{4,5} ]
Data Vector: 
22 
19 
27 
27 
Flat Vector: 
22 19 27 27 
Output Tensor (target): 
 22
 19
 27
 27
[ CPUDoubleType{4,1} ]

Unfortunately if I use the class MyDataset in the main function I get the error:

a cast to abstract class “MyDataset” is not allowed: – pure virtual function “torch::data::datasets::BatchDataset<Self, Batch, BatchRequest>::size [with Self=MyDataset, Batch=std::vector<torch::data::Example<at::Tensor, at::Tensor>, std::allocator<torch::data::Example<at::Tensor, at::Tensor>>>, BatchRequest=c10::ArrayRef<size_t>]” has no overriderC/C++(389)

Im using the class like this:
auto data_set = MyDataset(input_loc, output_loc);
Can please someone help me out.

EDIT / SOLUTION:
Ok I solved it by also overriding the size() method like this:

torch::optional<size_t> size() const override {
      return labels_.size(0);
    };

Something that I also had to do to resolve all compilation errors was to pass the data_loader inside the for range loop by pointer like this:

for (auto& batch: *data_loader) { ... };

Otherwise it would not compile.
I really miss more Tutorials for the C++ API in the Documentation area. How Can I contribute to add more tutorials for the C++ API so that beginners like me dont have this issues?

@shyney7
Take a look at this recent dataloader/dataset tutorial PR. You can contribute like this as well :slight_smile: Add custom dataset and dataloader tutorial for C++ by dhpollack · Pull Request #841 · pytorch/tutorials · GitHub

2 Likes
git clone --recursive https://github.com/pytorch/pytorch

these files give you complete example
~/pytorch/torch/csrc/api/include/torch/data/example.h
~/pytorch/torch/csrc/api/include/torch/data/datasets/mnist.h
~/pytorch/torch/csrc/api/src/data/datasets/mnist.cpp
~/pytorch/test/cpp/api/integration.cpp

full example