Hi Guys,
I have data sets consisting of about 100,000 of training examples. Every example has roughly 100 features (normalized between -2 to 2) and 2 labels (1, 0 - good, 0, 1 - bad) which looks something like this:
112 inputs | 2 outputs
-0.58572,-0.0248404,…,-0.926154,1.14224,-0.731815 | 0,1
-1.05784,-0.71582,…,0.989832,1.99402,-1.60195 | 0,1
1.92291,1.79471,…,-0.339433,0.572841,-0.306174 |1,0
…
This data sets are generated by c++ code and then passed to custom handcrafted NeuralNet. I was asked to rewrite old NN part using libtorch. I used the following NN definition (which more or less follows the old NN structure):
class TorchNet : public torch::nn::Module {
torch::nn::Linear input{ nullptr };
torch::nn::Linear hidden1{ nullptr };
torch::nn::Linear hidden2{ nullptr };
torch::nn::Linear hidden3{ nullptr };
torch::nn::Linear hidden4{ nullptr };
torch::nn::Linear hidden5{ nullptr };
torch::nn::Linear output{ nullptr };
int input_size = 0;
public:
TorchNet(int inputs_count);
torch::Tensor forward(torch::Tensor x);
// dataVec - vector of pairs of (input values, targets)
void train_step(std::vector<std::pair<torch::Tensor, torch::Tensor>> dataVec, torch::optim::Optimizer& optimizer);
// ...
};
TorchNet::TorchNet(int inputs_count)
{
input_size = inputs_count;
input = register_module("input", torch::nn::Linear(inputs_count, 30));
hidden1 = register_module("hidden1", torch::nn::Linear(30, 20));
hidden2 = register_module("hidden2", torch::nn::Linear(20, 15));
hidden3 = register_module("hidden3", torch::nn::Linear(15, 10));
hidden4 = register_module("hidden4", torch::nn::Linear(10, 5));
hidden5 = register_module("hidden5", torch::nn::Linear(5, 3));
output = register_module("output", torch::nn::Linear(3, 2));
}
torch::Tensor TorchNet::forward(torch::Tensor x) {
x = torch::tanh(input->forward( x ) );
x = torch::dropout(x, /*p=*/0.2, /*train=*/is_training());
x = torch::tanh(hidden1->forward(x));
x = torch::tanh(hidden2->forward(x));
x = torch::tanh(hidden3->forward(x));
x = torch::tanh(hidden4->forward(x));
x = torch::tanh(hidden5->forward(x));
x = output->forward(x);
return torch::log_softmax(x, 0);
}
// dataVec - vector of pairs of (input values, targets)
void TorchNet::train_step(std::vector<std::pair<torch::Tensor, torch::Tensor>> dataVec, torch::optim::Optimizer& optimizer)
{
train();
size_t batch_idx = 0;
double count = 0;
double max = dataVec.size();
double pct = 0;
for (auto& sample : dataVec) {
auto data = sample.first, targets = sample.second.to(at::kLong);
optimizer.zero_grad();
auto output = forward(data);
// data.sizes() = [112], output.sizes() = [2], targets.sizes() = [2]
auto loss = torch::nll_loss(output, targets);
AT_ASSERT(!std::isnan(loss.item<float>()));
loss.backward();
optimizer.step();
if (count / max >= pct)
{
std::cout << "after " << (100.0 * count / max) << "% loss = " << loss.item<double>() << std::endl;
pct += 0.2;
}
++count;
}
}
to prepare a training data I do the following:
// data - pair of 2 vectors - first: vector of 112 input values, second: vector of 2 labels
double TorchNetTest::train(std::vector<std::pair<std::vector<double>, std::vector<double>>> data, ...)
{
...
auto input_size = data[0].first.size();
model = std::make_unique< TorchNet>(input_size);
std::vector<std::pair<torch::Tensor, torch::Tensor>> train_vec;
for (auto data_label : data)
{
auto inputs = data_label.first; // a vector of 112 doubles
auto labels = data_label.second; // a vector of 2 doubles
auto sample = std::pair<torch::Tensor, torch::Tensor>();
sample.first = torch::tensor(c10::ArrayRef(inputs));
sample.second = torch::tensor(c10::ArrayRef(labels));
train_vec.push_back(sample);
}
torch::optim::SGD optimizer(model->parameters(),
torch::optim::SGDOptions(0.01).momentum(0.5));
...
model->train_step(train_vec, optimizer);
...
}
It works (model can be trained with proper accuracy) but its very slow comparing to old NN. I am using a CPU version of libtorch (old NN also works on CPU only) and I know it could be much faster on GPU but I would like to make it at least usable with CPU only. As I’m a newbie I’m sure my implementation is very far from optimal.
Do you have any suggestions on how to make it more performant?
For example:
Can I rearrange the data to make training step work faster?
Is it possible to perform training with multiple threads?
Is it possible to modify forword() / nll_loss() so it accepts multiple training examples instead of just one?
Thank you in advance.