I have two models, A and B, trained on Imagenet. Their accuracies on Imagenet validation set are 35.6% and 28.64% respectively, while the accuracy of their ensemble (averaging their scores) is 35.68%. I am interested in finding out why the ensembling isn't effective here.
Specifically, I was going to inspect the confusion matrices for each model, but Imagenet has more than 1000 classes which makes this intractable. Another thing that was suggested to me is Mutual Information, but I can't figure out how to apply it in this context.
So, I have a two part question:
- Why doesn't the accuracy of the ensemble degrade (to the average of two accuracies) or improve?
- Is there a way of visualizing/scoring the output of the two networks to measure correlation?
Edit 1: Both are AlexNet models, but were trained with two different pre-trained weight initializations. The pre-trained weights themselves come from two different self-supervised tasks. Also, when these models were trained (initialized with respective pre-trained weights) on Pascal, there's is a significant boost in the accuracy. Thus, my quest to figure out how do you measure the correlation between models that are ensembling.