Variations in learning with scale of data sample size?

pumplerod · December 1, 2021, 9:46pm

This is an appeal for recommendations from those with experience…

I have a situation where I was using a small subset of my data set while developing the model, just to make sure everything would flow through and working as expected. When I switched to using the full data set the training process seems to no longer improve which, I suspect could come from any number of things. I’m trying to narrow down my search to determine if there is something I am doing wrong, if this is an expected behavior, or if what I’m trying to do is so flawed that it should be abandoned.

I am attempting to train a model that will return a rank between two different samples. So very much a function akin to A>B. Each sample has several hundred features and I know from my training data what the rank order of the samples is. A coin flip should return 50% accuracy.

during the initial build process, with a truncated data set of about 4000 samples, I created (per epoch) a random batch of 100,000 sample pairs from the 4000. I’m using negative log for my training metric and measuring accuracy simply by rounding the final value to either 0 or 1 for comparison against the target. With this approach, training went from 50% accuracy to over 65% accuracy and I felt like things were working as expected.

when I swapped out the truncated data set for the full set, which has over 400 of these ranking periods, learning no longer progressed but only fluctuated between around 50-53% accuracy. I expected a slow down in the learning process, but not such a consistent oscillation.

Is there advice for where to begin to hack away at this? For example, maybe I should not be using batch normalization with batches of 100,000k samples. Or maybe what I’m trying to do is impossible. I’m basically just stabbing in the dark and changing parameters, learning rates, activation functions, number of hidden nodes/layers. I suspect, however, that there is a much better way to go about optimizing these sorts of problems and hoping that someone with more experience could recommend some reading or an approach having encountered this on some of their own work.

for a very simplified example of the type of data I’m dealing with…

          feature_1,  feature_2, feature_n, target_rank
sample_1      0.0,       0.5,       -0.5,    0.1
sample_2      0.0,       0.1,       0.5,     0.01
sample_n      0.0,       0.1,       0.5,     0.2

For training, I then generate a dataloader which will create a new sample; for example using features from sample_1 and sample_2 and generating a new target based on the target_rank of each sample.

       feature_a, feature_b, feature_n, feature_aa, feature_bb, feature_nn, target
sample   0.0,       0.5,       -0.5,         0.0,       0.1,       0.5,     1.0
...

If this issue rings familiar to anyone I would love some advice on how to break down the understanding for why the model’s learning seems to break down when the data set begins to scale so large.

tom · December 2, 2021, 9:58am

Some thoughts:

overfitting (on a single batch if you want) should work to the point where you get perfect training accuracy,
do you have some sort of baseline model?
The problem could be any number of reasons, from the problem being overly hard to your modelling not working well. A baseline would exclude the first to some extend. For the latter it probably is reasonable to start simple but large enough to overfit and then regularize (see Andrej Karpathy’s classic recipe).
Given that you want 0/1 output on pairs of samples, using techniques from recommender systems might be very useful. Unfortunately, that is an area where the number of “freely available tutorials” is not as large. For many things word2vec-style systems (also a large coincidence matrix) can help with the intuition.

Best regards

Thomas