1
$\begingroup$

I have two models, A and B, trained on Imagenet. Their accuracies on Imagenet validation set are 35.6% and 28.64% respectively, while the accuracy of their ensemble (averaging their scores) is 35.68%. I am interested in finding out why the ensembling isn't effective here.

Specifically, I was going to inspect the confusion matrices for each model, but Imagenet has more than 1000 classes which makes this intractable. Another thing that was suggested to me is Mutual Information, but I can't figure out how to apply it in this context.

So, I have a two part question:

  1. Why doesn't the accuracy of the ensemble degrade (to the average of two accuracies) or improve?
  2. Is there a way of visualizing/scoring the output of the two networks to measure correlation?

Edit 1: Both are AlexNet models, but were trained with two different pre-trained weight initializations. The pre-trained weights themselves come from two different self-supervised tasks. Also, when these models were trained (initialized with respective pre-trained weights) on Pascal, there's is a significant boost in the accuracy. Thus, my quest to figure out how do you measure the correlation between models that are ensembling.

New contributor
Ajinkya is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
$\endgroup$
  • 1
    $\begingroup$ It did improve from 35.6 to 35.68 ? $\endgroup$ – Isbister 15 hours ago
  • $\begingroup$ Yes, it did. But, it seems too small an improvement 0.08%. $\endgroup$ – Ajinkya 15 hours ago
  • $\begingroup$ So it is working then. It's not causing any harm. There is no guaranty at all that there would be a larger improvement. $\endgroup$ – Isbister 15 hours ago
  • $\begingroup$ Yes, I understand that, but I am more interested in why the accuracy only improved marginally. Had the accuracy improved significantly I know what's happening. Same for reduction in accuracy. I am looking for an alternative to confusion matrix. Something that lets me see what's happening. $\endgroup$ – Ajinkya 15 hours ago
  • 1
    $\begingroup$ How identical are the models? Do they have the same features, same structure? If they have learnt different things then there could be a higher increase because they complement each other. Otherwise it might just be information overlap. Have a look at their shap values github.com/slundberg/shap ? $\endgroup$ – Isbister 14 hours ago
0
$\begingroup$

From what you are saying on can make up that you predict scores for each class and that the highest score is your class prediction.

A question you may ask is whether or not the class prediction is affected by the ensembling at all. For the sake of the example consider an example where Model A predicts a class always with a normalised score of 1.0 (i.e. full confidence, recommended reading). Suppose Model A is right 80% of the time. Model B however, is less certain. It places his bets on the top 5 classes that it deems likely with an equal normalised score of 0.2. Averaging the scores results in the same top predicted class as for Model A alone. You could easily check this by calculating the correlations between the predictions of the models with the ensembled predictions.

$\endgroup$

Your Answer

Ajinkya is a new contributor. Be nice, and check out our Code of Conduct.

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Not the answer you're looking for? Browse other questions tagged or ask your own question.