Committee machines and/or random forests question

I stumbled upon this paper:

The problem I have is with understanding how committee machines work. The only relevant info I find in this article is this:

“The neural networks are executed simulta-
neously for the given input data and their outputs are
evaluated and combined to produce the final committee
output to obtain better generalization and performance. The
output combination module was often performed based on
simple functions on the outputs of individual members in
the committee machine, such as majority voting for clas-
sification and simple/weighted averaging for regression,
without involving the input vectors of attributes”

So what does that mean, exactly? I have 10 different neural networks, compute their result, average them, and use that average in error estimation and then do backprop? Or do I simply train 10 different neural networks, then run my test sample through all of them and average the result?

Based on the sentence before the snippet you posted:

In the neural-network committee machine approach, n neural network are trained for solving the same problem independently.

I assume the latter is right. It seems to be basically an ensemble of models for the final classification.
Regarding the method, it seems they just use majority voting without any learnable weights.

Given this is regression not classification, by “majority voting” you’re saying they simply average the results of all outputs of all 10 neural nets?

In a classification majority voting is just the most frequent prediction.
For regression they just average the predictions.

A more general version would be boosting. This can be done for almost every type of classifier or regressor and with fixed (as in your case) or learnable weights.

Average then. Thank you.
I’ll try training different NNs and see if I get results they do in the paper.

Okay, so it’s been a while and I trained an ensemble. Indeed, training more than one NN and averaging the results yields more accurate results. Usually… The error of a single NN in my case is (in Cartesian coordinates) about 2.2 cm, whereas with an ensemble it goes down to 1.5 cm. In the paper I mentioned there the authors report 10x better accuracy with using 6 NN which is definitely not the case for me and I wonder why is that. I can say that I trained my NNs with far more epochs and used the same NN architecture with the same hyperparameters but different seed for random number generator was provided for each NN. Is doing it this way much different form using different hyperparameters for each NN as the authors did and showed in table 3?

One more thing that bugs me in the article is this sentence:
In Eq. (10), the euclidian distance equation has been
given. This equation has been used to calculate the distance
between end effector and the target known as end effector
error. The selection of the best result among neural-network
results in the committee machine has been done using
this equation.
I don’t understand what “selection of the best result” means. During inference we can hardly tell to which one of those NNs the output is closest to so we can’t really “compare” the result to those NNs.

Using the same hyperparameters for the models in an ensemble is usually not the best idea.
This might lead to correlated outputs of the models and thus a low model variety.

I think this toy example was used in Stanford’s CS231n, but I cannot find the source anymore.
We compare an ensemble of good but correlated and worse but uncorrelated models:

target - 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

case 1:
modelA - 1, 1, 1, 1, 1, 1, 1, 1, 0, 0 - 0.8
modelB - 1, 1, 1, 1, 1, 1, 1, 1, 1, 0 - 0.9
modelC - 1, 1, 1, 1, 1, 1, 1, 1, 0, 0 - 0.8
majority - 1, 1, 1, 1, 1, 1, 1, 1, 0, 0 - 0.8

case 2:
modelA - 1, 0, 1, 1, 1, 1, 1, 0, 0, 1 - 0.7
modelB - 0, 1, 1, 1, 1, 1, 0, 1, 0, 1 - 0.7
modelC - 1, 1, 1, 0, 1, 1, 1, 1, 1, 0 - 0.8
majority - 1, 1, 1, 1, 1, 1, 1, 1, 0, 1 - 0.9

As you can see this is a toy example, but even though we ensemble worse models, the benefit comes from uncorrelated outputs, i.e. a high model variety.

I see.

I have now run 6 different trainings with different random seeds, neurons for hidden layer and different learning rates. The training has barely started, the error is not that big but when I compute ensemble error the result is often worse than that from one or two currently best nets. I’ll wait till morning for final results but at the moment the ensemble doesn’t look promising.

Also note that for ensemble method the trainset is usually splitted into N random subsets when training N models. Thus every model is trained on a different subset (they may overlap). This would lead to the case that every model learns slightly different features, becoming an “expert” on it’s own type of data (they usually tend to overfit). Afterwards the negative results of overfitting in a single model should be canceled out by running the ensemble classifier.

1 Like

Okay so I’ve tried all your suggestions: using different subsets of the training set for each nn, using different hyperparams for each nn. None of these produces networks that combined together give me more than 20-30% more accuracy.

Instead of relying on the committee I trained one more network, which also has 24 neurons in a layer, but has two (hidden) layers, not one. And that network managed to get 10x more accuracy than each of the 1-hidden-layer network. Nevertheless I would be happy to figure out how to get the committee working.

I wonder what the authors mean by “selection of the best result among neural-network
results in the committee machine”. Since they’re training 10 nns and say that only 6 are needed I would guess this “selection” refers to picking these 6 nns out of those 10 nns.

20-30% in increased accuracy sounds like a big improvement when using ensembles.
Usually you will get approx. 1-2% increase in accuracy.

If another architecture works better, I would go for it and maybe try an ensemble at the end of your experiments.

The authors of the paper report 10x increase in accuracy. Although the article is not clear to me in many aspects. They didn’t even say explicitly what kind of error function they are optimizing. I wonder if they just didn’t train dozens of NNs and picked best six; that’s what “selection of the best result among neural-network
results in the committee machine” could mean.