Implementing OpenCL backend for pytorch

I started developing a library that implements common DL operations in OpenCL.
It is somewhat similar to cudnn/miopen with addition of providing a library for inference and basic training.

The project is in very early stages but it is already:

I’m looking for a way to create a custom backend for pytorch. It is clear for me that it is lots of work, but it is technical part. The big, critical and complex part is writing good high performance kernels that worth something is already in very good shape.

I looked into “out-of-source” backend but the documentation is lacking and I wonder if there is some template and some minimal useful backend that I can “rewrite” for OpenCL.

Any tutorials and pointers will be appreciated