Probability regression models

I have 2-d data points and have probabilities for those data points (which are of the order of 10e-8). Now I have some functional form of the probability distribution (which has a complicated parameterisation of mean and covariance matrix - total number of parameters are 14). Now I want to find optimal parameters of the model which fits these observed probabilities. I am using PyTorch for doing this.

total number of data points: 6000
for 2-D data points I have
range of column 1: 0 to 6
range of column 2: 0 to 5

I am currently using mean squared error (which is probably wrong) and the loss at initialisation is very large and with epochs loss is decreasing and is converging to some value which is still very large
for eg. from 10e12 to 10e8.
I tried for 1000 epochs but nothing much improved. I even tried KL divergence loss but in that the loss instead of converging showed fluctuations. I am using Adam optimiser.

I have couple of questions:

  • How to initialise my parameters ?
  • Appropriate loss function for such probability regression problems.
  • Will SGD be a good optimisation algorithm for this task.
  • And what other optimisation algorithms are good for such tasks.
  • Standard models (and literature pointer) for such probability regression problems.

I am new to this problem and searched a lot of resourced but did not got any satisfactory answer and have been stuck for many days now.

Thanks in advance.