Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization. It's 100% free, no registration required.

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

My question is very similar to this one, which was not solved unfortunately.

I am working on a project for which I want to rank countries by means of their HIV/AIDS burden. So I collected a lot of data for all countries in the world. For simplicity let's assume that I have following variables for each country:

  • DEA: Deaths due to HIV
  • LIV: People living with HIV
  • PRV: HIV prevalence rate
  • DALY: number of healthy years lost due to HIV
  • DALY ratio: proportion of healthy years lost due to HIV in total number of healthy years lost due to disease in general.

So all these variables somehow measure the same thing: the HIV burden. Now I want to combine all these variables into one 'score', such that I can rank countries by means of their HIV burden.

The first thing that came into my mind was to to perform a principal component analysis and retain one PC. However, if we look at the loadings of this first PC we see the following:

  • DEA: 0.366
  • LIV: -0.392
  • PRV: -0.442
  • DALY: 0.466
  • DALY ratio: 0.481

Because of the high pairwise correlations between the variables I would have expected each of the loadings to have the same sign. Now countries with a high HIV burden (so scoring high on each of the variables) now get a lower score for the first PC on one side (due to the negative loadings of 'LIV' and 'PRV') and a higher score for the first PC on the other side (due to the positive effects of 'DEA', 'DALY' and 'DALY ratio').

My questions:

  • Is it correct that looking at the scores for the first PC is not a proper way to give a score for HIV burden to each of the countries because of the contrary loadings as explained above?

  • Can you suggest another (better way) to combine all the information into one single score?

share|improve this question
2  
So all these variables somehow measure the same thing: the HIV burden. Now I want to combine all these variables into one 'score'. Your variables sound to me too mixed, heterogeneous by their measurement units, quantum meaning, and probably distributional qualities. If so they should be avoided to combine directly in one score. Rather, it's better to rank countries by each variable separately and then to derive an overall rank (such as just mean or some more sofisticated). – ttnphns 1 hour ago
1  
And in general, pursue for a single index out of multidimensional reality is often unwarranted. The fake simplicity may corrupt minds of the audience and, worse, of decision-makers within it. What might be recommended in your place is to block the variables in few theoretically, conceptually kin ones (e.g., prevalence; survival; economical burden; etc.) and anylize by blocks; different analyses might be called for different blocks. It's not a simple, it's a creative task. – ttnphns 1 hour ago
    
Sound advice in the answer & comments here. Note also that the DALY purports to measure HIV burden by itself, already incorporating information on prevalence & mortality. – Scortchi 40 mins ago

Taking your example literally, I'd say the approach is problematic from the outset.

  • If the problem is assessing the total burden, then absolute numbers of deaths and people living with AIDS are key variables, but any PCA is likely to be dominated by a small number of countries with large populations. Even if you use correlation-based PCA, as you should when variables are in very different units, you will have some large outliers in there for most conceivable mixes of countries.

  • If the problem is assessing the total burden given population sizes, then the other variables are relevant.

  • It seems unlikely that mixing together different kinds of variables will help either purpose.

  • The biggest question of all is whether it's a good idea at all to seek a single scale in this way. The best that I can do is flag that statistically-minded people have very different views on this, many highly negative. My own view is that PCA of this sort will only be of interest to those capable of understanding and criticising the PCA and doing their own alternative analysis. A fallacy known under many different names, of which one is the fallacy of misplaced concreteness, is confusing a desire for a single measure with a demonstration that such a measure can be reliably and intelligibly identified from data. It's one thing to have a single name (creativity, intelligence, in this case burden) and another thing to have a single quantifiable dimension.

Turning to your results, what's most alarming, as you clearly flag, is that the loadings on the first PC don't even have the same sign. If there is one important shared dimension that justifies trying to quantify burden as a single measure, then it minimally requires all those variables to be positively correlated with each other (or for reversals of sign to be obvious consequences of some measures being direct and some inverse, which doesn't seem the case here). Without seeing the data, I can't interpret further, but I'd expect the variation in sign to be a side-effect of mushing together quite different variables that are also skewed in distribution and with outliers.

Plotting the data will help you understand why you got the results you did.

I don't have suggestions for a different way to collapse to a single score. I've seen too many applications in which such endeavours were not helpful to be positive there.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.