does it use the same dropout mask for every timestep? if not how to make it work that way while still maintaining the same performance as using nn.RNN?
This type of dropout is better according to the following paper, and it is also the dropout used in keras.
No, it doesn’t, because PyTorch’s RNNs are thin wrappers around cuDNN, which doesn’t support time-locked dropout masks. Users can implement it themselves, though at the cost of reduced speed due to inability to use the optimized cuDNN kernel.