OpenCL Inference Engine

bzcheeseman · May 20, 2017, 3:30am

I just wanted to share something I’ve been working on. It’s in the very early stages, but I was hoping to get some assistance in building it up - I think it could be really useful.

I realize that a full OpenCL port would be a huge amount of work (and I would like to be able to use models on OpenCL devices), so I decided to take matters into my own hands. This project allows you to train up a network in pytorch, then save each of the tensors (for now, individually but I’d love to change that) for weights/filters/bias and then load them into ArrayFire. Using ArrayFire as our OpenCL library, we then perform the forwards pass as usual. The easiest way I’ve found to go python -> C++ is through numpy and using their API. I have to do some index gymnastics because of the ArrayFire conventions, but so far I can initialize tensors for Conv2d and Linear layers in python, save them, load them into C++ and perform an inference. I realize it’s more cumbersome and not ideal for training and development, but it’s meant as a tool for deployment.

I would love to have help from people who know ArrayFire/pytorch better than I do - the next layers to do are pooling layers (maxpool, avgpool) and batchnorm (especially batchnorm). Help/suggestions for optimizing ArrayFire code (I’m a total newb with AF) would be awesome as well.

Project link: pytorch-inference
Please be gentle, I’ve only been working on it for about a day at this point so it’s still pretty rough.

smth · May 20, 2017, 6:45pm

great. you might also be interested to see:

bzcheeseman · May 20, 2017, 6:56pm

I hadn’t seen the second one - thanks! I saw the first one but only briefly, I’ll take another look.

Correct me if I’m wrong on the second one (pytorch2c) - you have to compile the graph for each input it seems?

smth · May 20, 2017, 7:02pm

not for each input. it uses the trace of the graph generated via an input (because pytorch uses tape-based autodiff)

bzcheeseman · May 22, 2017, 6:26am

Just a quick update - I’ve managed to implement many of the layers that are in mvitz/thnets, and the documentation (while admittedly incomplete) has its own doxygen site and everything

I’m having issues with unpooling efficiently and for some reason softmax is slow. If someone who knows arrayfire well has advice I’d welcome it!

The pytorch2c repo is fascinating but it has doesn’t look like it has any acceleration at all (unless it supports CUDA, and even then I need OpenCL for embedded devices). I’m looking into the transfer techniques that both of them use to see if there’s an approach that’s more portable and cleaner than what I’m using now.

lantiga · May 22, 2017, 8:01am

In order for pytorch2c to work cleanly we will need forward traces to be implemented on top of the new autograd. That’s what I tasked myself to do now in PyTorch, although I had to slow down due to other commitments. I’ll start working actively on it once this crazy week is over.

Also, ideally I’d like to change the way pytorch2c works: first generate an intermediate representation, then compilable code. This way we’ll be able to easily add new backends (cuda, opencl, …) in addition to THNN.