There has been a lot of work on different data representations such as fixed-point and floating-point with various bit widths for efficient training and inference, especially in deep neural networks. I have implemented customizable data types and arithmetic operations in Python where the user can choose a data type, customize the number of bits allocated to each field, and define how arithmetic operations are performed (e.g., truncate a specified number of bits before computation). My implementation can be integrated with NumPy by creating arrays where the dtype is object instead of NumPy’s default data types.

I have recently started using PyTorch and really liked it compared to other machine learning libraries. I would like to integrate my code into PyTorch, but I’m not sure where to begin. I have a few questions that will hopefully help me throughout this process.

What are the most relevant parts of the source code I should start reading?

Can I use my current Python implementation and be able to run it on GPUs or I should re-implement it in C/C++/Cuda?

Interesting idea. I think most library functions are only available in 64bit, 32 bit (and maybe 16bit from cuda). So it is possible to trim the end results to a certain number of bits, but the intermediate computation results is probably very difficult to refined to a certain number of bits. I’m wondering if you get your code to work with numpy on blas functions.

Currently, I have defined separate classes for each data type where the number of bits for each field can be customized. I understand the trimming existing data types like float64 is a solution, but I’d rather avoid that because a user may choose a bit width for exponent that is larger than what IEEE standard supports.

Can you please elaborate your last sentence a little bit? Or a snippet of code that I can run using my data types?
If you mean something like doing matrix multiplication using NumPy and on my data types, I don’t think the current implementation works. Because my NumPy is complied to use openblas, but matrix multiplication is performed using a single core, so I assume NumPy is using its own engine rather than openblas.

I think my implementation is too simple at the moment because it simply defines basic operations such as __add__, __sub__, etc. on the data types. I’m not familiar with BLAS, but I assume there’s more I have to do to make use of openblas, MKL, etc.

Yeah, those libraries are usually compiled with 32bit or 64bit floating point types in C and/or FORTRAN. It is probably nontrivial to define efficient arbitrary length floating types in those languages, especially considering memory alignment.

All the classes I have implemented use __slots__ to define member variables. As a result, the memory consumed by each object is determined at the time of creation and is typically a multiple of 8 bytes. Do you think this will solve the problem?

Besides, what I had in mind was to let users try different representations and arithmetic operations in Python. Later on, they can implement their optimized design on an FPGA.

The goal I’m trying to achieve at this point is to run Python code on GPU to gain a huge speedup.

I just remembered I had seen the PyTorch internals a while ago. This article explains how to add a new data type in CPython and integrate it into PyTorch.

For basic data types such as Float, the THGenerateFloatType.h header file has a line that looks like this:

#define real float

where all instances of real in Tensor.cpp will be replaced by float when the code is automatically generated.

Can I define my own data type using something like this:

typedef struct {
PyObject_HEAD
int first;
int second;
} PyMyObject;

and creating a new header file that has a line that looks like this: