Typically, GMMs are trained with expectation-maximization, because of the need for implementing the unitary constraint over the categorical variables.

However, in Pytorch, it is possible to get a differentiable log probability from a GMM. Why is this possible? How exactly is the constraint implemented in the code?

Do you refer to MixtureSameFamily? That one contains a nested “Categorial” distribution object, with non differentiable distribution parameters.

It is possible (though not trivial) to train Categorical with sampling - docs describe REINFORCE / score function. This issue is orthogonal to using gradient descent to train GMMs in general, or “unitary constraint” (that is handled by using Categorical distribution [class])

Ah, right, log_prob / local MLE simultaneous estimation works, it is just not too good with random nn initializations and SGD. May be ok for some tasks.