(libtorch) How to use torch::data::datasets for custom dataset?

Hengd · January 9, 2019, 6:58am

Hi everyone,

Following the https://github.com/goldsborough/examples/blob/cpp/cpp/mnist/mnist.cpp, I am trying to write my own program for training using libtorch.

However, I don’t find any documents about how to load my own dataset.
It seems that the C++ API is similar to Python API.

First, using torch::data::datasets to create a object of dataset.
Second, using torch::data::make_data_loader to create a pointer of loader.

But, I don’t know how to define custom dataset using torch::data::datasets. Does anyone can help me?

mhubii · March 15, 2019, 6:33pm

Hi @Hengd,

well actually it is super easy . Just as in this example for the MNIST dataset, you can implement a torch::data::datasets::Dataset<Self, SingleExample>. Therefore, you need to override the get(size_t index) method from Dataset. What you need to do, is to get your data from somewhere and convert it into a Tensor, but this is up to you.

#include <torch/torch.h>

// You can for example just read your data and directly store it as tensor.
torch::Tensor read_data(const std::string& loc)
{
    torch::Tensor tensor = ...

    // Here you need to get your data.

    return tensor;
};

class MyDataset : public torch::data::Dataset<MyDataset>
{
    private:
        torch::Tensor states_, labels_;

    public:
        explicit MyDataset(const std::string& loc_states, const std::string& loc_labels) 
            : states_(read_data(loc_states)),
              labels_(read_data(loc_labels) {   };

        torch::data::Example<> get(size_t index) override;
};

torch::data::Example<> MyDataset::get(size_t index)
{
    // You may for example also read in a .csv file that stores locations
    // to your data and then read in the data at this step. Be creative.
    return {states_[index], labels_[index]};
}

Then, you want to generate a data loader from it, just do

// Generate your data set. At this point you can add transforms to you data set, e.g. stack your
// batches into a single tensor.
auto data_set = MyDataset(loc_states, loc_labels).map(torch::data::transforms::Stack<>());

// Generate a data loader.
auto data_loader = torch::data::make_data_loader<torch::data::samplers::SequentialSampler>(
    std::move(data_set), 
    batch_size);

// In a for loop you can now use your data.
for (auto& batch : data_loader) {
    auto data = batch.data;
    auto labels = batch.target;
    // do your usual stuff
}

Hopefully this helps, although I don’t know the kind of data you are trying to read in.

Martin

dambo · June 15, 2019, 3:57pm

Hello,
Were you able to come up with a working example that mimics that of the PyTorch-based dataset for reading images?

class GenericDataset(torch.utils.data.Dataset):
  def __init__(self, labels, root_dir, subset=False, transform=None):
    self.labels = labels
    self.root_dir = root_dir
    self.transform = transform

  def __len__(self):
    return len(self.labels)

  def __getitem__(self, idx):
    img_name = self.labels.iloc[idx, 0]  # file name
    fullname = join(self.root_dir, img_name)
    image = Image.open(fullname).convert('RGB')
    labels = self.labels.iloc[idx, 2]  # category_id
    #         print (labels)
    if self.transform:
      image = self.transform(image)
    return image, labels

Thanks,

mhubii · June 23, 2019, 4:03pm

Hi @dambo,

yes, the above example mimics the PyTorch version of a dataset.

I will implement an example which clarifies it further and post the link here.

mhubii · June 24, 2019, 9:47am

I have now implemented a little classifier with a custom dataset that classifies apples and bananas. You can find it here

dambo · June 27, 2019, 1:09pm

Thanks you very much.

jianjianGJ · October 22, 2019, 2:08pm

help a lot!!
thank you!!

abtion · November 13, 2019, 9:29am

Hi, I found that the example only contains the data and target, how can i do while my data contains many components. (for example, the sentence simlilarity classfication dataset, every item of this dataset contains 2 sentences and a label, for this dataset, I would like to define sentence1, sentence2 and label rather than image and labels)
How can I do that? thanks!
some python code are follow:

class MyDataset(torch.utils.data.Dataset):
  def __init__(self, text1,text2,labels):
    self.labels = labels
    self.sentence1 = text1
    self.sentence2 = text2
......

mhubii · November 13, 2019, 12:04pm

how do you encode your sentences? Are they one hot encoded? Or do you read them in as strings, and then encode them somehow?

abtion · November 14, 2019, 2:36am

yes, they are one hot encoded.

shyney7 · March 10, 2021, 9:36am

Thank you for your example how to use libtorch to create own datasets- loaders. Ive followed your example and created a read_data() function that returns a tensor from a csv file by first creating a vector then flatten the vector and then creating tensors in the right shape by using from_blob.
Here is the output of my input and output vectors and tensors (each row is an observation):

Data Vector: 
9 1 4 2 6 
2 7 5 2 3 
4 3 7 8 5 
5 2 4 7 9 
Flat Vector: 
9 1 4 2 6 2 7 5 2 3 4 3 7 8 5 5 2 4 7 9 
Input Tensor: 
 9  1  4  2  6
 2  7  5  2  3
 4  3  7  8  5
 5  2  4  7  9
[ CPUDoubleType{4,5} ]
Data Vector: 
22 
19 
27 
27 
Flat Vector: 
22 19 27 27 
Output Tensor (target): 
 22
 19
 27
 27
[ CPUDoubleType{4,1} ]

Unfortunately if I use the class MyDataset in the main function I get the error:

a cast to abstract class “MyDataset” is not allowed: – pure virtual function “torch::data::datasets::BatchDataset<Self, Batch, BatchRequest>::size [with Self=MyDataset, Batch=std::vector<torch::data::Example<at::Tensor, at::Tensor>, std::allocator<torch::data::Example<at::Tensor, at::Tensor>>>, BatchRequest=c10::ArrayRef<size_t>]” has no overriderC/C++(389)

Im using the class like this:
auto data_set = MyDataset(input_loc, output_loc);
Can please someone help me out.

EDIT / SOLUTION:
Ok I solved it by also overriding the size() method like this:

torch::optional<size_t> size() const override {
      return labels_.size(0);
    };

Something that I also had to do to resolve all compilation errors was to pass the data_loader inside the for range loop by pointer like this:

for (auto& batch: *data_loader) { ... };

Otherwise it would not compile.
I really miss more Tutorials for the C++ API in the Documentation area. How Can I contribute to add more tutorials for the C++ API so that beginners like me dont have this issues?

glaringlee · March 10, 2021, 4:56pm

@shyney7
Take a look at this recent dataloader/dataset tutorial PR. You can contribute like this as well Add custom dataset and dataloader tutorial for C++ by dhpollack · Pull Request #841 · pytorch/tutorials · GitHub

Jesse_Stone · September 1, 2022, 2:14am

git clone --recursive https://github.com/pytorch/pytorch

these files give you complete example
~/pytorch/torch/csrc/api/include/torch/data/example.h
~/pytorch/torch/csrc/api/include/torch/data/datasets/mnist.h
~/pytorch/torch/csrc/api/src/data/datasets/mnist.cpp
~/pytorch/test/cpp/api/integration.cpp

full example