Ok, I figured it out by looking at the source code. To anyone who wants to understand the weights and calculations in the multi-head attention, here is a simple gist
Ok, I figured it out by looking at the source code. To anyone who wants to understand the weights and calculations in the multi-head attention, here is a simple gist