How to efficiently compute gradient for each training sample?

Yaroslav_Bulatov · November 11, 2019, 6:30am

It should work for those architectures as well. What’s missing is support for other layers with trainable parameters, (ie the multiheadattention layer)