Principal component analysis in R Commander

Ambox_importantWe recommend you read the following entry of basic statistics:
Multivariate statistical analysis in R Commander

Principal component analysis (PCA, or neurally PCA, Principal Component Analysis) consists of generating new variables that are the result of key linear combinations of the original, getting together the greatest possible variation by reducing their number. The first component includes the greater part of variation, the second somewhat less, and so on. In this way, instead of having many variables we have just a few grouping most of the observed variation.

In some cases, this smaller number of variables or principal components (usually two) can be used to perform multiple linear regressions.

Types of data

  • A set of quantitative variables measured on a sample of individuals.
  • Multivariate normal distribution

Example:

For our example of principal components analysis we took 8 numeric variables to 14 newborns with low weight (download data):

  • V1 = age of mother (years)
  • V2 = number of cigarettes smoked by the mother to the day
  • V3 = height of the mother (inches)
  • V4 = weight of the mother (pounds)
  • V5 = age of the father (years)
  • V6 = level of studies of the father
  • V7 = number of cigarettes smoked per day by father
  • V8 = height of the father (inches)

Can I reduce the number of variables by grouping the greatest possible variation?

Step 1. We load the data into R Commander

Once downloaded the sample data, we introduce them into R Commander using the following route:

Load data
Data – Loading DataSet…

And visualize them to visualize data set:

Data components

Step 2. Run the analysis of principal components (PCA)

For the analysis of main components to the sample data, we follow the following path into R Commander:

Analysis of main components in R Commander
Statistical – Dimensional analysis – analysis of main components…

Us now following window appears in which you have to select the 8 variables of our example, by clicking and dragging from the V1 to the V8:

Principal component analysisPrior to accepting, we are going to the Options tab, and determine if we want to standardize the data or not.

2.1. Standardization, what is and how to use it in R Commander?

In the analysis of component main is important, depending on the nature of our data, standardizing them or not:

  1. Standardize: when the variables in the study have scales or different units of measure. He is calculated from the matrix of correlations (having variance = 1).
  2. Not standardize: when the variables in the study have scales or units of measure equal. It is calculated from the covariance matrix.

Principal components analysis - standardize - R Commander As the sample data have different scales and measures (years, heights, weights, etc), we have to standardize by selecting "Analyze the correlations matrix".

2.2. Results and interpretation

The results are as shown below in R Commander:

Screenshot-R Commander-1

But let's see what all that data:

> local ({}
+   . PC <-princomp(~V1+V2+V3+V4+V5+V6+V7+V8, cor=TRUE, data=Datos)
+ cat ("nComponent loadings:n")
+ print (unclass (loadings (.))) PC)))
+ cat ("nComponent variances:n")
+ print (.) PC$ sd ^ 2)
+ cat ("n")
+ print (summary (.)) PC))
+ })

Component loadings:
        Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
V1 0.45927632 - 0.3300700 0.42221491 - 0.06912846 0.014058950 0.09484631 - 0.08936572 0.692744662
V2 - 0.42975751 - 0.2662236 0.08772638 - 0.32688585 0.445127021 - 0.45333765 0.43955233 0.181724713
V3 - 0.06900791 - 0.6387247 - 0.16726543 0.11575498 - 0.186804457 0.52531263 0.45973313 - 0.153908871
V4 0.03880912 - 0.5597814 - 0.41999034 0.27503283 0.007091408 - 0.44997788 - 0.48008663 - 0.009493601
V5 0.48981256 - 0.1997337 0.44528178 - 0.07725784 0.091430449 - 0.27164575 0.10749267 - 0.649799865
V6 0.38390014 0.1163544 - 0.37150886 0.14717175 0.782395703 0.24418386 0.08728867 0.003984449
V7 - 0.43872101 - 0.1867023 0.42200008 0.02548362 0.378383648 0.38241113 - 0.52162403 - 0.180079334
V8 - 0.13540205 0.1027849 0.30869124 0.87809277 0.055148344 - 0.16553023 0.25632827 0.092836342

Interpretation of results (Component loadings):
Also known as loading factors, autovectores, or eigenvectors), are the coefficients of the equation of each main component.
For example, the component main 1 (Comp. (1), presents the following equation:

CP1 = 0.45927632 * Z1 – 0.42975751 * Z2 – 0.06900791 * Z3 + 0.03880912 * Z4 + 0.48981256 * Z5 + 0.38390014 * Z6 – 0.43872101 * Z7 – 0.13540205 * Z8

Note that in the equation, the original variables (V1-V8) have been replaced by Z (Z1-Z8), since they are the Estandarizadas variables.

Alliages component:
    Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 
2.68616770 1.86196171 1.11240271 1.03044187 0.61920451 0.36727816 0.27738461 0.04515874 

Interpretation of results (Component alliages):
They are known as eigenvalues. The value of each component is the square of the standard deviation (see the following interpretation). The sum total gives 8, since they are 8 main components and are standardized.

Importance of components:
                         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
Standard deviation 1.638953 1.3645372 1.0547050 1.0151068 0.78689549 0.60603478 0.52667315 0.212505858
Proportion of Variance 0.335771 0.2327452 0.1390503 0.1288052 0.07740056 0.04590977 0.03467308 0.005644842
Cumulative Proportion 0.335771 0.5685162 0.7075665 0.8363717 0.91377231 0.95968208 0.99435516 1.000000000

Interpretation of results (Importance of components):

-Standard deviation: shows the standard deviations of each main component. He is calculated from the data obtained by substituting the values of every newborn in the equation of each main component.
-Proportion of Variance: is the proportion of variance that explains each main component. Their sum is equal to 1. This row is really important for our results.
-Cumulative proportion: is the ratio of cumulative, adding them gradually.

We note that the first two components grouped a 56.9% of the variance, or what is the same, there is a 43.1% of variation that is not explained. Therefore, do with how many components were we?

2.3. How many components we were?

An informal method to determine how many components we relies on choosing the main components which together…

  • … more than 70% of the total variation, and…
  • … If they come from standardized data, that their associated eigenvalues are greater than 1.

The eigenvalues of each main component can be displayed graphically. The name given this chart is called a graphic breakdown or sedimentation (scree diagram). R Commander:

Eigenvalues and crumble graph
Following the route described above for the ACP, we select the Options tab, and click on chart of sedimentation.

The chart that comes with the eigenvalues (eigenvalues) of each main component is as follows:

Graphic breakdown or sediment

Then copying part of the previous results, note that the first 3 components grouped the 70.8% of variation…

Importance of components:
                         Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
Standard deviation 1.638953 1.3645372 1.0547050 1.0151068 0.78689549 0.60603478 0.52667315 0.212505858
Proportion of Variance 0.335771 0.2327452 0.1390503 0.1288052 0.07740056 0.04590977 0.03467308 0.005644842
Cumulative Proportion 0.335771 0.5685162 0.7075665 0.8363717 0.91377231 0.95968208 0.99435516 1.000000000

… and up to the fourth major component with more than 1 in its eigenvalues:

Alliages component:
    Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 
2.68616770 1.86196171 1.11240271 1.03044187 0.61920451 0.36727816 0.27738461 0.04515874 

We are therefore left with the 4 first principal components.

2.4. How do you interpret all this in our example?

We have decided to finally meet us with the 4 first principal components, because they are those who meet the requirements set forth in the preceding paragraph. But… What does mean? We copy the coefficients of the equations of the components and look at what are the highest in each of these absolute values:

Component loadings:
        Comp.1 Comp.2 Comp.3 Comp.4      
V1 0.45927632 - 0.3300700 0.42221491 - 0.06912846
V2 - 0.42975751 - 0.2662236 0.08772638 - 0.32688585
V3 - 0.06900791 - 0.6387247 - 0.16726543 0.11575498
V4 0.03880912 - 0.5597814 - 0.41999034 0.27503283 
V5 0.48981256 - 0.1997337 0.44528178 - 0.07725784
V6 0.38390014 0.1163544 - 0.37150886 0.14717175
V7 - 0.43872101 - 0.1867023 0.42200008 0.02548362
V8 - 0.13540205 0.1027849 0.30869124 0.87809277 
  • Component 1 gathers a 33.6% of variation, and the highest absolute values of its coefficients are those present in V1, V2 and V5 and V7. This means that these variables are which grouped this variation to a greater extent. In our example, these variables correspond to the age of the parents and numbers of cigarettes smoked daily by them. It therefore seem to have relationship with which children are born with low weight.
  • Component 2 brings a 23.3% of the variation, and the highest absolute values of its coefficients are those present in V3 and V4, which correspond to the shape of the stem (height and weight).
  • The other two components are somewhat more diffuse interpret, but they are not necessary to carry out a proper analysis of main components.

2.4.1. Graphical representation of the first two components

Taking into account that was executed at the start of the analysis (then a copy/paste), copy the line highlighted in bold and underlined and take the starting point (.) PC)…

> local ({}
+   . PC <-princomp(~V1+V2+V3+V4+V5+V6+V7+V8, cor=TRUE, data=Datos)
+ cat ("nComponent loadings:n")
+ print (unclass (loadings (.))) PC)))
+ cat ("nComponent variances:n")
+ print (.) PC$ sd ^ 2)
+ cat ("n")
+ print (summary (.)) PC))
+ })

… so that is, writing under biplot (PC):

PC <-princomp(~V1+V2+V3+V4+V5+V6+V7+V8, cor=TRUE, data=Datos)
biplot (PC)

This copy it and paste into R Commander (on the R Script tab). We select both lines and click Run. The result is as follows:

Graph of main components

We see that the horizontal axis represents the main component 1 and ordered the 2 main component.

How much longer are the Red arrows, higher is the value of the coefficient of that variable in that component. We note that we have perfectly the graphic representation of the observed in section 2.4.

For example, we see that in Principal component 1, longer arrows are V5 and V1 (positive value) and V7 and V2 (in negative value), which correspond to the highest coefficients obtained in section 2.4 for that component. On the other hand, the 2 main component presents two variables (V3 and V4) with greater length, which again correspond with the two variables chosen in section 2.4.

2.5. Add the main components to the data set

We have decided to finally meet the 4 first principal components. How add them to the set of data to continue the statistics with them?

Add components to the data set
Following the route described above for the ACP, we select the Options tab, and click on add components to the data set.

We will be shown the following window in which we will say that we want 4 components:

4 main components

To see data set, you should see 4 components added after the initial 8 variables:

Main components added to the initial data set

2.6. Reducing the number of variables, objective of the principal component analysis

If instead of generating new variables (as in the previous case), want us to dispose of the least information between the original of our study, some authors have proposed methods to do so.

Joliffe (1972, 1973) proposes to reduce the original variables directly to the number of them who meet the following requirements:

  1. Choose the main components whose eigenvalues are greater than 0.7.
  2. Of the selected components, select the variable with largest absolute value (which no has been selected previously).

In our example, we have 4 main components whose eigenvalues are above 0.7 (see previously). The eigenvectors of each component are shown below:

Component loadings:
        Comp.1 Comp.2 Comp.3 Comp.4      
V1 0.45927632 - 0.3300700 0.42221491 - 0.06912846
V2 - 0.42975751 - 0.2662236 0.08772638 - 0.32688585
V3 - 0.06900791 - 0.6387247 - 0.16726543 0.11575498
V4 0.03880912 - 0.5597814 - 0.41999034 0.27503283 
V5 0.48981256 - 0.1997337 0.44528178 - 0.07725784
V6 0.38390014 0.1163544 - 0.37150886 0.14717175
V7 - 0.43872101 - 0.1867023 0.42200008 0.02548362
V8 - 0.13540205 0.1027849 0.30869124 0.87809277 
  • In Comp. 1 the highest absolute value of their coefficients (0.4898) corresponds to V5.
  • In Comp. 2 corresponds to V3.
  • In Comp. 3 corresponds to V5, but as it is selected in the Comp. 1, select the following with more value, V1.
  • In Comp. 4 corresponds to V8.

We therefore be exclusively with variables V1, V3, V5 and V8.

Long live free Software!

References

  • García Pérez, a., 2005. Advanced methods of applied statistics. Advanced techniques, 1st ed. Universidad Nacional de Educación a Distancia, Madrid.

Deja un comentario

*