Understanding why ensembling only improves marginally

Question

I have two models, A and B, trained on Imagenet. Their accuracies on Imagenet validation set are 35.6% and 28.64% respectively, while the accuracy of their ensemble (averaging their scores) is 35.68%. I am interested in finding out why the ensembling isn't effective here.

Specifically, I was going to inspect the confusion matrices for each model, but Imagenet has more than 1000 classes which makes this intractable. Another thing that was suggested to me is Mutual Information, but I can't figure out how to apply it in this context.

So, I have a two part question:

Why doesn't the accuracy of the ensemble degrade (to the average of two accuracies) or improve?
Is there a way of visualizing/scoring the output of the two networks to measure correlation?

Edit 1: Both are AlexNet models, but were trained with two different pre-trained weight initializations. The pre-trained weights themselves come from two different self-supervised tasks. Also, when these models were trained (initialized with respective pre-trained weights) on Pascal, there's is a significant boost in the accuracy. Thus, my quest to figure out how do you measure the correlation between models that are ensembling.

So it is working then. It's not causing any harm. There is no guaranty at all that there would be a larger improvement. — Isbister, 15 hours ago
Yes, I understand that, but I am more interested in why the accuracy only improved marginally. Had the accuracy improved significantly I know what's happening. Same for reduction in accuracy. I am looking for an alternative to confusion matrix. Something that lets me see what's happening. — Ajinkya, 15 hours ago
How identical are the models? Do they have the same features, same structure? If they have learnt different things then there could be a higher increase because they complement each other. Otherwise it might just be information overlap. Have a look at their shap values github.com/slundberg/shap ? — Isbister, 14 hours ago

Simon · Accepted Answer · 2019-02-07 20:57:44Z

From what you are saying on can make up that you predict scores for each class and that the highest score is your class prediction.

A question you may ask is whether or not the class prediction is affected by the ensembling at all. For the sake of the example consider an example where Model A predicts a class always with a normalised score of 1.0 (i.e. full confidence, recommended reading). Suppose Model A is right 80% of the time. Model B however, is less certain. It places his bets on the top 5 classes that it deems likely with an equal normalised score of 0.2. Averaging the scores results in the same top predicted class as for Model A alone. You could easily check this by calculating the correlations between the predictions of the models with the ensembled predictions.

asked	today
viewed	34 times
active	today

Stack Exchange Network

current community

your communities

more stack exchange communities

Understanding why ensembling only improves marginally

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged machine-learning ensemble-modeling or ask your own question.

Hot Network Questions

Understanding why ensembling only improves marginally

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged machine-learning ensemble-modeling or ask your own question.

Related

Hot Network Questions