Quiero leer...

- 1 Step 1. We load the data into R Commander
- 2 Step 2. Run the analysis of principal components (PCA)
- 2.1 2.1. Standardization, what is and how to use it in R Commander?
- 2.2 2.2. Results and interpretation
- 2.3 Course A/B Testing in R
- 2.4 Course Data Visualization with ggplot2 (Part 1)
- 2.5 Course Machine Learning Toolbox
- 2.6 Course Reporting with R Markdown
- 2.7 2.3. How many components we were?
- 2.8 2.4. How do you interpret all this in our example?
- 2.9 2.5. Add the main components to the data set
- 2.10 2.6. Reducing the number of variables, objective of the principal component analysis

### We recommend you read the following entry of basic statistics:

**Multivariate statistical analysis in R Commander**

Pr**incipal component analysis (PCA, **or neurally PCA, Principal Componen*t Analysis) consists of ge*nerating new variables that are the result of key linear combinations of the original, getting together the greatest possible variation by reducing their number. The first component includes the greater part of variation, the second somewhat less, and so on. In this way, instead of having many variables we have just a few grouping most of the observed variation.

In some cases, this smaller number of variables or principal components (usually two) can be used to perform mu**ltiple linear regressions**.

## Types of data

- A set of qua
**ntitative variables**measured on a sample of individuals. **Multivariate normal distribution**

Example:For our example of p

rincipal components analysis we took 8 numeric variables to 14 newborns with low weight (download data):

V1 = age of mother (years)V2= number of cigarettes smoked by the mother to the dayV3= height of the mother (inches)V4= weight of the mother (pounds)V5= age of the father (years)V6= level of studies of the fatherV7 = number of cigarettes smoked per day by fatherV8= height of the father (inches)Can I reduce the number of variables by grouping the greatest possible variation?

# Step 1. We load the data into R Commander

Once downloaded the sample data, we introduce them into R** Commander** using the following route:

And visualize them **to visualize data set**:

# Step 2. Run the analysis of principal components (PCA)

For the analys**is of main components to the s**ample data, we follow the following path into R** Commander**:

Us now following window appears in which you have to select the 8 variables of our example, by clicking and dragging from the V1 to the V8:

Prior to accepting, we are going to the Optio**ns tab, **and determine if we want to **standardize **the data or not.

## 2.1. Standardization, what is and how to use it in R Commander?

In t**he analysis of component main is** important, depending on the nature of our data, standar**dizing them o**r not:

**Standardiz**e: when the variables in the study have scales or different units of measure. He is calculated from th**e matrix of correlati**ons (having variance = 1).**Not standardi**ze: when the variables in the study have scales or units of measure equal. It is calculated from**the covariance matrix**.

As the **sample data have different scales and measures (ye**ars, heights, weights, etc), we have to** standardiz**e by selecting** "Analyze the correlations matrix"**.

## 2.2. Results and interpretation

The results are as shown below in R C**ommander**:

But let's see what all that data:

> local ({} + . PC <-princomp(~V1+V2+V3+V4+V5+V6+V7+V8, cor=TRUE, data=Datos) + cat ("nComponent loadings:n") + print (unclass (loadings (.))) PC))) + cat ("nComponent variances:n") + print (.) PC$ sd ^ 2) + cat ("n") + print (summary (.)) PC)) + }) Component loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 V1 0.45927632 - 0.3300700 0.42221491 - 0.06912846 0.014058950 0.09484631 - 0.08936572 0.692744662 V2 - 0.42975751 - 0.2662236 0.08772638 - 0.32688585 0.445127021 - 0.45333765 0.43955233 0.181724713 V3 - 0.06900791 - 0.6387247 - 0.16726543 0.11575498 - 0.186804457 0.52531263 0.45973313 - 0.153908871 V4 0.03880912 - 0.5597814 - 0.41999034 0.27503283 0.007091408 - 0.44997788 - 0.48008663 - 0.009493601 V5 0.48981256 - 0.1997337 0.44528178 - 0.07725784 0.091430449 - 0.27164575 0.10749267 - 0.649799865 V6 0.38390014 0.1163544 - 0.37150886 0.14717175 0.782395703 0.24418386 0.08728867 0.003984449 V7 - 0.43872101 - 0.1867023 0.42200008 0.02548362 0.378383648 0.38241113 - 0.52162403 - 0.180079334 V8 - 0.13540205 0.1027849 0.30869124 0.87809277 0.055148344 - 0.16553023 0.25632827 0.092836342

**Interpretation of results (Co mponent loadings): **

Also known as loadin

**g factors, aut**ov

**ectores, or**eig

**envectors)**, are the coefficients of the equation of each main component.

For example, the component main 1 (Comp. (1), presents the following equation:

CP1 = 0.45927632 * Z1 – 0.42975751 * Z2 – 0.06900791 * Z3 + 0.03880912 * Z4 + 0.48981256 * Z5 + 0.38390014 * Z6 – 0.43872101 * Z7 – 0.13540205 * Z8

*Note that in the equation, the original variables (V1-V8) have been replaced by Z (Z1-Z8), since they are the Estandarizadas variables.*

Alliages component: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 2.68616770 1.86196171 1.11240271 1.03044187 0.61920451 0.36727816 0.27738461 0.04515874

**Interpretation of results (Co mponent alliages): **

They are known as

**eigenvalue**s. The value of each component is the square of the standard deviation (see the following interpretation). The sum total gives 8, since they are 8 main components and are standardized.

Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Standard deviation 1.638953 1.3645372 1.0547050 1.0151068 0.78689549 0.60603478 0.52667315 0.212505858 Proportion of Variance 0.335771 0.2327452 0.1390503 0.1288052 0.07740056 0.04590977 0.03467308 0.005644842 Cumulative Proportion 0.335771 0.5685162 0.7075665 0.8363717 0.91377231 0.95968208 0.99435516 1.000000000

**Interpretation of results (Imp ortance of components): **

-Standard deviation: shows the standard deviations of each main component. He is calculated from the data obtained by substituting the values of every newborn in the equation of each main component.

-Proportion of Variance: is the proportion of variance that explains each main component. Their sum is equal to 1. This row is really important for our results.

-Cumulative proportion: is the ratio of cumulative, adding them gradually.

**#R #Python #MachineLearning #BigData #DataAnalysis**

** We note that the first two components grouped a 56.9% of the variance, or what is the same, there is a 43.1% of variation that is not explained. Therefore, do with how many components were we?**

## 2.3. How many components we were?

An informal method to determine how many components we relies on choosing the main components which together…

- … more than 70% of the total variation, and…
- … If they come from standardized data, that their associated eigenvalues are greater than 1.

The eigenvalues of each main component can be displayed graphically. The name given this chart is called a** graphic breakdown or sedimentation (sc ree diagram)**. R

**Commander**:

The chart that comes with **the eigenv**al* ues (eigen*values) of each main component is as follows:

Then copying part of the previous results, note that the first 3 components grouped the **70.8%** of variation…

Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Standard deviation 1.638953 1.3645372 1.0547050 1.0151068 0.78689549 0.60603478 0.52667315 0.212505858 Proportion of Variance 0.335771 0.2327452 0.1390503 0.1288052 0.07740056 0.04590977 0.03467308 0.005644842 Cumulative Proportion 0.335771 0.5685162 0.7075665 0.8363717 0.91377231 0.95968208 0.99435516 1.000000000

… and up to the **fourth major component wi**th more than 1** **in its **eigenvalues**:

Alliages component: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 2.68616770 1.86196171 1.11240271 1.03044187 0.61920451 0.36727816 0.27738461 0.04515874

We are therefore left with the 4 first principal components.

## 2.4. How do you interpret all this in our example?

We have decided to finally meet us with the 4 first principal components, because they are those who meet the requirements set forth in the preceding paragraph. But… What does mean? We copy the coefficients of the equations of the components and look at what are the highest in each of these absolute values:

Component loadings: Comp.1 Comp.2 Comp.3 Comp.4 V1 0.45927632 - 0.3300700 0.42221491 - 0.06912846 V2 - 0.42975751 - 0.2662236 0.08772638 - 0.32688585 V3 - 0.06900791 - 0.6387247 - 0.16726543 0.11575498 V4 0.03880912 - 0.5597814 - 0.41999034 0.27503283 V5 0.48981256 - 0.1997337 0.44528178 - 0.07725784 V6 0.38390014 0.1163544 - 0.37150886 0.14717175 V7 - 0.43872101 - 0.1867023 0.42200008 0.02548362 V8 - 0.13540205 0.1027849 0.30869124 0.87809277

- C
**omponent 1 ga**thers a 33.6% of variation, and the highest absolute values of its coefficients are those present in V1, V2 and V5 and V7. This means that these variables are which grouped this variation to a greater extent. In our example, these variables correspond to the**age of the paren**ts**and numbers of cigarettes smoked daily by them**. It therefore seem to have relationship with which children are born with low weight. - C
**omponent 2 br**ings a 23.3% of the variation, and the highest absolute values of its coefficients are those present in V3 and V4, which correspond to the shap**e of the stem (he**ight and weight). - The other two components are somewhat more diffuse interpret, but they are not necessary to carry out a proper analysis of main components.

### 2.4.1. Graphical representation of the first two components

Taking into account that was executed at the start of the analysis (then a copy/paste), copy the line highlighted in b**old and** underlined and take the starting point (.)~~ ~~PC)…

> local ({} +. PC <-princomp(~V1+V2+V3+V4+V5+V6+V7+V8, cor=TRUE, data=Datos)+ cat ("nComponent loadings:n") + print (unclass (loadings (.))) PC))) + cat ("nComponent variances:n") + print (.) PC$ sd ^ 2) + cat ("n") + print (summary (.)) PC)) + })

… so that is, writing under bi**plot (PC)**:

**PC <-princomp(~V1+V2+V3+V4+V5+V6+V7+V8, cor=TRUE, data=Datos)
biplot (PC)**

This copy it and paste into** R Commande**r (on the R S**cript tab**). We select both lines and cli**ck Run**. The result is as follows:

We see that the ho**rizonta**l axis represe**nts the main compo**nent 1 and** ordere**d th**e 2 main component**.

How much longer are the Red arrows, higher is the value of the coefficient of that variable in that component. We note that we have perfectly the graphic representation of the observed in section 2.4.

For example, we see that in Principal component 1, longer arrows are V5 and V1 (positive value) and V7 and V2 (in negative value), which correspond to the highest coefficients obtained in section 2.4 for that component. On the other hand, the 2 main component presents two variables (V3 and V4) with greater length, which again correspond with the two variables chosen in section 2.4.

## 2.5. Add the main components to the data set

We have decided to finally meet the **4 first principal components**. How add them to the set of data to continue the statistics with them?

We will be shown the following window in which we will say that we want 4 components:

T**o see data set, you s**hould see 4 components added after the initial 8 variables:

## 2.6. Reducing the number of variables, objective of the principal component analysis

If instead of generating new variables (as in the previous case), want us to dispose of the least information between the original of our study, some authors have proposed methods to do so.

Joliffe (1972, 1973) proposes to reduce the original variables directly to the number of them who meet the following requirements:

- Choose the main components whose eig
**envalues are greater than 0.7**. - Of the selected components, select the variabl
**e with largest absolute value (**which no has been selected previously).

In our example, we have 4 main components whose eigenvalues are above 0.7 (see previously). The eigenvectors of each component are shown below:

Component loadings: Comp.1 Comp.2 Comp.3 Comp.4 V1 0.45927632 - 0.3300700 0.42221491 - 0.06912846 V2 - 0.42975751 - 0.2662236 0.08772638 - 0.32688585 V3 - 0.06900791 - 0.6387247 - 0.16726543 0.11575498 V4 0.03880912 - 0.5597814 - 0.41999034 0.27503283 V5 0.48981256 - 0.1997337 0.44528178 - 0.07725784 V6 0.38390014 0.1163544 - 0.37150886 0.14717175 V7 - 0.43872101 - 0.1867023 0.42200008 0.02548362 V8 - 0.13540205 0.1027849 0.30869124 0.87809277

- In Comp. 1 the highest absolute value of their coefficients (0.4898) corresponds to V5.
- In Comp. 2 corresponds to V3.
- In Comp. 3 corresponds to V5, but as it is selected in the Comp. 1, select the following with more value, V1.
- In Comp. 4 corresponds to V8.

We therefore be exclusively with variables V1, **V3,**** V5** **and**** V8**.

**Long live free Software!**

### References

- García Pérez, a., 2005. Advanced methods of applied statistics. Advanced techniques, 1st ed. Universidad Nacional de Educación a Distancia, Madrid.