I’m not sure if there is a performance comparison.
We run tests in the CI also on Windows machines, so it should be tested. @peterjc123 might chime in on this topic as the expert.
I did a test on Windows, C++ inference time to compare the performance of PyTorch and CNTK. Unfortunately, CNTK seems to be faster! I just simply create a tensor and pass it through a traced network. I tested with resnet50, resnet18 and vgg16. Loading the model takes MUCH longer using Pytorch but that is ok. Forwarding is worse using PyTorch!
My GPU is GTX 1050.
This is the code:
int num = 1000;
std::string smodel = "address/resnet18_cuda_trace.pt";
torch::jit::script::Module module = torch::jit::load(net_path_.c_str());
auto start = std::chrono::high_resolution_clock::now();
for (int times = 0; times < num; times++) {
auto ten = torch::ones({ 1, 3, 224, 224 }, torch::kCUDA);
std::vector<torch::jit::IValue> inputs;
inputs.push_back(ten);
at::Tensor out = module.forward({ inputs }).toTensor().to(torch::kCPU);
}
auto finish = std::chrono::high_resolution_clock::now();
auto msSinceStart = std::chrono::duration_cast<std::chrono::milliseconds>(finish - start).count();
std::cout << msSinceStart << " " << out.sizes() << std::endl;
Is there anything I can do to make this simple code faster?
If you’re only doing inference you could also consider exporting your model via ONNX and doing inference with ONNX Runtime. If you’re tightly integrated with windows you might benefit from the WinRT APIs distributed as part of the OS.
You’ll probably want to do some benchmarks for your use case, but I’ve found ORT to generally have excellent performance even without tuning. PyTorch for trainig+development and ORT for deployment is a pretty good combo