gpytorch performs very poorly when the data has different distributions

Hi everyone, before I delve deeper into my problem, I want to clarify that I’m using a simple RBF Kernel and a Gaussian likelihood function for the gpytorch model. We also have an embedding network that transforms a data triplet (explained later) into a latent version, which then goes into the Gaussian process. For simplicity, my current dataset consists of discounts where each discount was offered to customer (identified by customerID) that bought a product from a specific product group (PG). Each PG includes discounts offered to customers who bought specific products. The Gaussian Process is used to predict multiple features based on the CustomerId PG combination, such as discounts. To achieve this, I provide the embeddings with a triplet consisting of the CustomerId PG combination and an index for the feature that needs to be predicted. For example, the feature index 18 predicts the discount. So, when I give the GP the triplet (2102, XO, 18), the GP should predict the optimal discount for the customer 2102 who buys a product from the product group XO.

We now encountered a problem with gpytorch. Although the model performed well with GPflow (which is strange), the performance with gpytorch was poor. Gpytorch performs well when the distribution of discounts for PGs is very similar. For instance, when we use a subset of PGs where the discount is distributed ONLY as two Gaussian models (a Gaussian mixture) with similiar variance and mean values, gpytorch performs well. But if we have other PGs that have discount distributions that show only one Gaussian process, a strong variety in variances or mean values, or distributions that look more like log-gamma distributions, the Gaussian process performs very poorly.

Gpytorch’s Gaussian process struggles to handle different distribution types. We observed much better results when we trained individual models for each PG based on its distribution type. However, we face the challenge of combining all PGs into one dataset while still achieving good results. How can we optimize gpytorch to handle more complex tasks? Are there any specific techniques or parameters that we should focus on? I would greatly appreciate any ideas or suggestions.

Regards Kai