Just a quick update - I’ve managed to implement many of the layers that are in mvitz/thnets, and the documentation (while admittedly incomplete) has its own doxygen site and everything
I’m having issues with unpooling efficiently and for some reason softmax is slow. If someone who knows arrayfire well has advice I’d welcome it!
The pytorch2c repo is fascinating but it has doesn’t look like it has any acceleration at all (unless it supports CUDA, and even then I need OpenCL for embedded devices). I’m looking into the transfer techniques that both of them use to see if there’s an approach that’s more portable and cleaner than what I’m using now.