It should work for those architectures as well. What’s missing is support for other layers with trainable parameters, (ie the multiheadattention layer)
It should work for those architectures as well. What’s missing is support for other layers with trainable parameters, (ie the multiheadattention layer)