Suboptimal convergence when compared with TensorFlow model

It was this part that made me think it could lead to noticeable difference–

Tensorflow Adam – “    The sparse implementation of this algorithm (used when the gradient is an
    IndexedSlices object, typically because of `tf.gather` or an embedding
    lookup in the forward pass) does apply momentum to variable slices even if
    they were not used in the forward pass (meaning they have a gradient equal to zero. Momentum decay (beta1) is also applied to the entire momentum
accumulator. This means that the sparse behavior is equivalent to the dense
behavior (in contrast to some momentum implementations which ignore momentum
unless a variable slice was actually used).”

As I see a lot of training embedding models in pytorch and would be comparing to tensorflow I bet a lot of these performance differences stem from that as it auto applies the momentum decay and we would have default not too.

Anyways I have always been able to get just as good or better than tensorflow performance but I usually use custom stuff most the time but the underlying framework has shown no insuffiency in performance for me and usually find quite the opposite