Training NN for multi-label data (binary bit array of dim 50)


This is my problem setup.
Train Input size (6300x300) These are standard BERT embeddings, so floating point numbers, mostly negatives.
Train Output size (6300x50) These are binary bit arrays like [0, 0, 1, 1, 0, … 0]
I am using a validation dataset of size 800.

I want to learn a NN network (with two hidden layers) that can map between input to output of train data. I have tried BCEloss. i played with learning rate, weight decay, dropout probability, batch size-these parameters. I also increased hidden parameter size to 1000, changed number of layers. But apparently my validation loss does not decrease or it increases. Training loss stops decreasing after some epochs.

I think my challenge is I have too many labels. I am careful to increase hidden layer size too much as that will increase NN complexity. Eventually i will try to scale my model to 100k sample size.

Can you please help me how I can design NN for this problem.