Encoding arbitrarily nested, structured data with trees of MHSA + pooling blocks

I have spent the last several years to wrangle TensorDict and @tensorclass into an extensible framework able to encode arbitrarily nested (structured) json-like objects for predictive modeling tasks.

Basically, a tree of transformer encoder and pooling modules that run attend to structured inputs of any data type. Nullable numbers, categorical data, datetypes, hashable objects like device IDs. Anything goes. Want more data types? You can build custom data types (examples included in docs). The framework also includes some internal plumbing for managing shared state - so those categorical features can learn new vocabulary during training time.

This is designed to operate at scale and stream from billions of nested data structures with zero feature engineering. The framework includes some custom data plumbing with support for new data loaders, but also custom preprocessing and postprocessing so zero batch feature engineering or data wrangling is required.

Built in decision heads for multi-target, multi-label, multi-task classification and/or regression problems, and general embedding

Built in real time model deployments and batch inference - both with feature parity (so no feature engineering lift and shift). 100k+ transactions per second throughput on a single A10G…

Chess engines, recommendation systems, distilling BERT, monitoring financial transactions for fraud detection … The system is pretty universal in applicability.

All open source - I would greatly appreciate any insight and recommendations for future improvements !

1 Like