Bayesian regression for categorical respons variable

Matthijs · April 29, 2023, 1:59pm

Hi all,

I just started out using pytorch so bear with me. I have a dataset of 60000 explanatory variables and 324 categorical response variables. I want to use feature selection on this dataset to see which variables explain this dataset the best and use it for other algorithms.

My priors and likelihood are very simple. These will not be the final distributions, but they work for this example.

As of right now, I use Bayesian linear regression to predict the weights of each of the explanatory variables and see what the 100 most influential dependent variables are by checking the intercept of the weights.

However, since I don’t know how to properly use/predict categorical response variables, I put the variable name in a dictionary and change the variable to its indices. This is not the correct way to do it, but I do now have the knowledge to do so. How do I properly predict which variable is most influencial for categorical response variables and what kind of model should I use?

# Load data
data = pd.read_csv('gene.csv')

# extract the labels
labels = data.columns[1:]
label_string = [label for label in labels.values]
dictionary = {k: v for k, v in enumerate(label_string)}

# map the labels to their corresponding indices
data.iloc[:, 0] = range(len(data))

y = torch.tensor(data.iloc[:, 0].values).double()
X = torch.tensor(data.iloc[:, 1:].values).double()

# Define the model
def create_model(X, y):
    num_features = X.shape[1]
    # Priors on model parameters
    w = pyro.sample('w', dist.Normal(torch.zeros(num_features), 1)).double()
    b = pyro.sample('b', dist.Normal(0, 10)).double()
    # Linear regression
    mu = torch.matmul(X, w) + b
    # Likelihood of observed data
    pyro.sample('y', dist.Normal(mu, 1), obs=y)

nuts_kernel = NUTS(create_model)
mcmc = MCMC(nuts_kernel, num_samples=1000, warmup_steps=500)
mcmc.run(X, y)
posterior_samples = mcmc.get_samples()

# Extract the means and standard deviations of the posterior distributions of the model parameters
w_mean = posterior_samples['w'].mean(axis=0)
w_std = posterior_samples['w'].std(axis=0)
b_mean = posterior_samples['b'].mean()
b_std = posterior_samples['b'].std()

# get 100 most influencial weights

# Get mean absolute weight for each feature 
mean_abs_weights = abs(posterior_samples['w']).mean(axis=0)


# Sort features based on mean absolute weight
sorted_features = np.argsort(mean_abs_weights)
sorted_features_flipped = torch.flip(sorted_features, [0])

# Get indices of top 100 features
top_100_features = sorted_features_flipped[:100]

#print incides top 100 features
print(top_100_features)