---
title: "Lab: Scale Development and Factor Analysis"
author: "PHS2000B"
date: "1 May 2023"

output: 
    pdf_document :
      highlight: kate
      toc: true
      number_sections: yes
fontsize: 12 pt
header-includes:
- \usepackage{bm}
---
\newcommand\independent{\perp\!\!\!\perp}

\newcommand{\E}{{\operatorname E}}
\newcommand{\Var}{{\operatorname {Var}}}
\newcommand{\Cov}{{\operatorname {Cov}}}
\newcommand{\Cor}{{\operatorname {corr}}}

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval = T)
```

```{r packages, message=FALSE, eval = T, include=FALSE}
if (!require("psych")) install.packages("psych")
if (!require("GPArotation")) install.packages("GPArotation")
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("gridExtra")) install.packages("gridExtra")
if (!require("kableExtra")) install.packages("kableExtra")
if (!require("reshape2")) install.packages("reshape2")

library(psych)
library(GPArotation)
library(ggplot2)
library(gridExtra)
library(pander)
library(reshape2)
library(knitr)
library(kableExtra)

options(knitr.table.format = "latex")
theme_update(plot.title = element_text(hjust = 0.5))

cesd <- read.csv("nhanes1_cesd.csv")
cesd <- cesd[complete.cases(cesd), 4:23]
names(cesd) <- c("bothered", "appetite", "blues", "good", "mind", "depressed", "effort", "hopeful", "failure", "fearful", "sleep", "happy", "talk", "lonely", "unfriendly", "enjoy", "cry", "sad", "dislike", "getgoing")
```

*Acknowledgment: Thanks to Louisa Smith for extensive work on an earlier version of the exploratory data analysis and factor analysis materials.*

# The Centers for Epidemiologic Studies Depression inventory (CES-D)

Today we will be using data from the CES-D (Centers for Epidemiologic Studies Depression inventory)^[Radloff LS. The CES-D scale: a self-report depression scale for research in the general population. Applied Psychological Measurement. 1977;1:385-401.] from the National Health and Nutrition Examination Survey I (1971-1974). This self-report depression scale was developed to assess depressive symptoms in the general population, and has found to be both reliable and valid in a nubmer of populations.^[A lot of information and references about the CES-D can be found here: https://www.sralab.org/rehabilitation-measures/center-epidemiological-studies-depression-scale-ces-d]

However, there are a lot of ways that depression can manifest, and two people who have equally strong depression could have symptoms along different dimensions, such as loss of interest in activities, feeling sad, or experiencing psychosomatic pain. We will be using responses to the 20 questions that make up the CES-D to illustrate the exploratory analyses you will be conducting for your Measurement Project, including using factor analysis to explore the underlying dimensions of the scale.

We'll restrict our analysis to white women and use only those observation with complete data on all 20 of the items. These data are included in the file `nhanes1_cesd.csv`. 

The 20 questions ask about the subject's experiences in the past week and are scored on a four point scale:

* 0 = Rarely or none of the time (less than 1 day)
* 1 = Some or a little of the time (1-2 days)
* 2 = Occasionally or a moderate amount of time (3-4 days)
* 3 = Most or all of the time (5-7 days)

The items include:

1. I was bothered by things that usually don't bother me.
2. I did not feel like eating; my appetite was poor.
3. I felt that I could not shake off the blues even with help from my family or friends.
4. I felt that I was just as good as other people.
5. I had trouble keeping my mind on what I was doing.
6. I felt depressed.
7. I felt that everything I did was an effort.
8. I felt hopeful about the future.
9. I thought my life had been a failure.
10. I felt fearful.
11. My sleep was restless.
12. I was happy.
13. I talked less than usual.
14. I felt lonely.
15. People were unfriendly.
16. I enjoyed life.
17. I had crying spells.
18. I felt sad.
19. I felt that people disliked me.
20. I could not get going.

# Scale Development

Our goal for the measurement project is to develop a scale that measures a latent construct. To accomplish this, you have generated an item pool, reviewed the items for content, and will be collecting empirical data from a sample of respondents.

Once you have the data collected, we've asked you to compute:

1. item means
2. item variances (or standard deviations)
3. the correlation matrix of all of the items
4. run an exploratory factor analysis to explore the underlying factor structure of your instrument
5. reliability (internal consistency)
6. uncorrected and corrected item-scale correlations

We'll illustrate each of these steps below, making use of the functions in the `psych` package. Note that there is a lot of useful information and code in this document and we don't expect you to absorb it all in one sitting. As you work on your measurement project data analysis, refer back to this document and feel free to bring your questions to class or office hours or to post on the discussion board.

\newpage
## Exploring item means, standard deviations, and distributions

```{r results="asis"}
# Here is a quick function to output nobs, mean, std, min, and max
f.summarize <- function(x){
data.frame(obs=apply(x,2,length),mean=apply(x,2,mean),
           std=apply(x,2,sd), min=apply(x,2,min), max=apply(x,2,max))
}

kable(f.summarize(cesd), digits = 3, 
      caption = "Mean, standard deviation, minimum, and maximum of CES-D items",
      booktabs = T, longtable = T)

```

* Why might we want to look at item means and standard deviations?

It may also be helpful to visualize the distribution of responses to each item.

```{r}
# use ggplot to output summary histograms of all CES-D items.
# note: lapply repeatedly calls ggplot to generate the plots
# marrangeGrob from the gridExtra package lays out the plots
plot.items <- lapply(names(cesd)[1:20],
       function(n)
         ggplot(data=cesd, aes_string(x=n)) +
         geom_bar())
ml <- marrangeGrob(plot.items, nrow=5, ncol=4)
ml
```

## Computing and visualizing the correlation matrix

Now let's look at the correlation matrix for the items. This should give you some idea of whether your items are measuring the same thing (though it won't tell you if you're measuring the right thing!).
```{r, results = 'asis'}
cor_mat <- cor(cesd)
rownames(cor_mat) <- colnames(cor_mat) <- names(cesd)

panderOptions("round",3)
pander(cor_mat, caption="CES-D item correlation matrix")
```

The full correlation matrix of all 20 items is hard to absorb. Sometimes it's more informative to look at a graphical depiction of the correlation matrix. There are many ways to do this; here's one:

```{r, tidy = F}
# This function turns the correlation matrix into a "long" dataset for use with ggplot
cor_dat <- melt(cor_mat, value.name = "Correlation")

# We are going to use the cor_dat dataframe, with Var1 on the x-axis and 
# Var2 on the y-axis, and we are going to fill in the colors according 
# to the correlation values
ggplot(cor_dat, aes(Var1, Var2, fill = Correlation)) + 
# Each data point is going to be portrayed as a colored tile  
  geom_tile() + 
# This function helps make pretty colors  
  scale_fill_distiller(type = "seq", palette = "Blues", direction=1) + 
# A bunch of formatting things I always have to Google
  theme(axis.title = element_blank(), panel.background = element_blank(), 
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(), 
        axis.text.x = element_text(angle = 45, hjust = 1.1, vjust = 1.1), 
        axis.ticks = element_blank())
```

* What patterns can you see in the visual display of the correlation matrix?

# Factor analysis

We can think of factor analysis as giving us a formal way to characterize the patterns in the correlation matrix.

## Factor analytic model

We'll define $\mathbf{X}_i$ as the vector of manifest variables: a set of $p$ observable items corresponding to individual $i$.^[For quick reference: $i$ will index individuals: $i = 1,\ldots, n$; $j$ will index items: $j = 1, \ldots, p$; $k$ will index factors: $k = 1, \ldots, q$] 
$$\mathbf{X}_i = (X_{i1}, X_{i2}, \ldots, X_{ip})^T$$
In our example, these are the values assigned to the answers "most or all of the time" (3), "occasionally" (2), and "some or a little of the time" (1), and "rarely" (0) in response to the series of $p = 20$ questions asked of each participant.

Each of the $X_{ij}$ has a mean value, $\mu_j$, which we don't really care about (we only care about the *relationship* between our manifest variables and the underlying factors). We can center $\mathbf{X}_i$ by subtracting the item-specific means, so that we have $\mathbf{X}_i - \bm{\mu}$, which is still a vector of length $p$:
$$\mathbf{X}_i - \bm{\mu} = (X_{i1} - \mu_{1}, X_{i2} - \mu_{2}, \ldots, X_{ip} - \mu_{p})^T$$
On the left-hand side we now have the item scores relative to their means -- in other words, whether a person's response was exceptionally high, low, or close to average.

Now let's assume that each of these centered, observable random variables is some linear combination of unobservable random variables. In particular, each contains some (possibly zero-valued) contribution from each of the $q$ latent factors ($q < p$), along with some added random error. 

In our example, those factors may reflect latent constructs such as depressed affect, interpersonal functioning, etc. We may or may not know what we expect these factors should be.^[*Exploratory* factor analysis involves trying to determine the factor structure from the observed data, while *confirmatory* factor analysis involves testing how well a pre-defined factor structure fits the data. We're focusing on exploratory factor analysis here.]

For each item $X_{ij}$, $j = 1, \ldots, p$, for individual $i$ (that is, for each item in the above vector), this looks like:
$$X_{ij} - \mu_j = \lambda_{j1}F_{i1} + \lambda_{j2}F_{i2} + \ldots + \lambda_{jq}F_{ij}  + e_{ij}$$
So we assume that the $j$th (centered) item for person $i$ is made up of a linear combination of their $q$ factors $\mathbf{F}_i$ (a vector of length $q$, specific to individual $i$) and the associated loadings $\bm{\lambda}_j$ (also a vector of length $q$, but it will be the same for every $i$ -- it's a constant), plus a random error.

Now you can see why matrix notation might be useful! We can represent this model for each of the $p$ items in the following form, where each of those mean-centered items is a row:

$$
\begin{bmatrix}
X_{i1} \\
X_{i2} \\
\vdots \\
X_{ip}
\end{bmatrix}
- 
\begin{bmatrix}
\mu_{1} \\
\mu_{2} \\
\vdots \\
\mu_{q}
\end{bmatrix}
=
\begin{bmatrix}
\lambda_{11} & \lambda_{12} & ... & \lambda_{1q}\\
\lambda_{21} & \lambda_{22} & ... & \lambda_{2q}\\
\vdots & \vdots & \ddots & \vdots \\
\lambda_{p1} & \lambda_{p2} & ... & \lambda_{pq}
\end{bmatrix}
\begin{bmatrix}
F_{i1} \\
F_{i2} \\
\vdots \\
F_{iq}
\end{bmatrix} + 
\begin{bmatrix}
e_{i1} \\
e_{i2} \\
\vdots \\
e_{ip}
\end{bmatrix}
$$
which corresponds to:
$$\mathbf{X}_{p\times 1} - \bm{\mu}_{p\times 1} = \bm{\Lambda}_{p\times q}\mathbf{F}_{q\times 1} + \mathbf{e}_{p\times 1}$$
Note that on both sides of the equation we end up with a column vector of length $p$.^[If we wanted to represent this model in terms of the observed data, with its $n$ observations, we could instead have a $p \times n$ matrix on each side of the equation. Recall that we can add and subtract matrices as long as their dimensions match, and we can multiply matrices as long as their inner dimensions match, which produces a matrix with dimension equal to their outer dimensions. If you would like some matrix algebra practice, I encourage you to write out the matrix operations above and show that each entry of the matrix we would have on the right-hand side is of the form above: $\lambda_{j1}F_{i1} + \lambda_{j2}F_{i2} + \ldots + \lambda_{jq}F_{iq}  + e_{ij}$.]

Writing the model out like this helps us see which components of the model are random variables and which are fixed constants:
$$\mathbf{X}_i - \bm{\mu} = \bm{\Lambda}\mathbf{F}_i + \mathbf{e}_i$$

This looks somewhat like the setup for other regression models you may have seen. However, note that we don't observe *anything* on the right-hand side of this model! And yet our goal is to estimate the parameters $\bm{\Lambda}$ -- the loadings that link each factor to each item.

We can take the covariance of both sides:
$$\Cov(\mathbf{X}_i - \bm{\mu}) = \Cov(\bm{\Lambda}\mathbf{F}_i + \mathbf{e})$$
First we can replace $\Cov(\mathbf{X}_i - \bm{\mu})$ with $\bm{\Sigma}$, the covariance matrix of $\mathbf{X}_i$, since $\bm{\mu}$ is a constant vector.

Since we are assuming $\mathbf{F} \independent \mathbf{e}$, we can split the sum on the right-hand side. Then we can pull out the loadings from $\Cov(\bm{\Lambda}\mathbf{F}_i)$, since it is a matrix of constants, not random variables.^[We could also write $\Cov(\bm{\Lambda}\mathbf{F}_i)$ as $\Var(\bm{\Lambda}\mathbf{F}_i)$; they are equivalent. In both cases, since we are working with a multivariable $\mathbf{F}_i$, we are referring to its variance-covariance matrix. Recall that we can pull out a constant from the variance of a single random variable, $\Var(aX) = a^2\Var(X)$, and from the covariance $\Cov(aX, bY) = ab\Cov(X, Y)$. When we're dealing with multivariable $\mathbf{F}_i$, we have both situations (the diagonal of a covariance matrix is composed of the variances of each individual $F$). But we don't observe $\mathbf{F}_i$ directly; instead, we observe $p$ linear combinations of the $q$-dimensional $\mathbf{F}_i$. We are representing that transformation with the $p\times q$ matrix $\bm{\Lambda}$. In such a situation, we can factor out the constants from the variance-covariance matrix of a multivariable random variable $\mathbf{X}$ by writing $\Cov(\mathbf{AX}) = \mathbf{A}\Cov(\mathbf{X})\mathbf{A}^T$.]

Now we have
$$\bm{\Sigma} = \bm{\Lambda}\Cov(\mathbf{F}_i)\bm{\Lambda}^T + \Cov(\mathbf{e}_i)$$
The next assumption we need to invoke is that $\Cov(\mathbf{F}_i) = \mathbf{I}$. That is, we assume that the factors are uncorrelated (we will loosen this assumption when we do *oblique rotation* below). This means that $\Cov(\mathbf{F}_i)$ falls out -- very conveniently, since we can't observe $\mathbf{F}_i$! We also assume that the errors are independent with a diagonal variance-covariance matrix $\bm{\Psi}$, leaving us with:
$$\bm{\Sigma} = \bm{\Lambda}\bm{\Lambda}^T + \bm{\Psi}$$
We can observe values that allow us to estimate the left-hand side. Factor analysis is all about trying to partition the observed variance-covariance matrix of the items into the matrix for the loadings, and the diagonal matrix representing the contributions of the random error.


## Model fitting: The principal axis approach

We're going to walk through one way to fit this model using the principal axis approach (also called principal factor analysis). This is primarily of interest as a heuristic for helping us understand how we go about decomposing the correlation matrix of the manifest variables. 

### Eigendecomposition

Let's assume all the observed variables are standardized to have variance 1, so the covariance matrix is equal to the correlation matrix ($\Cor(\mathbf{X}_i) = \Cov(\mathbf{X}_i) = \bm{\Sigma}$). We'll be using the correlation matrix to illustrate how the $\bm{\Lambda}$ matrix can be estimated.

When we have a symmetric matrix like the correlation matrix (remember that $\Cor(X, Y)$ = $\Cor(Y, X)$, so all elements on opposite sides of the diagonal match), something called the spectral theorem tells us that we can always *diagonalize* it.

What this means for the correlation matrix $\bm{\Sigma}$ is that there is some matrix $\mathbf{A}$ and some diagonal matrix (which means that the only non-zero elements are on the diagonal) $\bm{\Delta}$ such that we can write
$$\bm{\Sigma} = \mathbf{A}\bm{\Delta}\mathbf{A}^{T}$$
This is called the *eigendecomposition*: the *eigenvectors* make up the columns of matrix $\mathbf{A}$ and the *eigenvalues* are the elements on the diagonal of $\bm{\Delta}$.

What is an eigenvalue? When we multiply a vector by a matrix, the vector is transformed in some way. For a given matrix, some vectors are just scaled by a factor. Those pairs of vectors and their scaling factors are called eigenvectors and eigenvalues. For our purposes, think of a matrix as having a certain amount of information in it. When we do eigendecomposition, an eigenvalue represent how much information is explained by its corresponding eigenvector -- larger ones explain more; those near 0 mean there's little left over to explain after taking into account the other eigenvectors. If an eigenvalue is 0, that means its eigenvector is redundant: it's just a linear combination of the other eigenvectors.

Since $\bm{\Delta}$ is a diagonal matrix, it looks like this:
$$
\bm{\Delta} = 
\begin{bmatrix}
\delta_1 & 0 & ... & 0\\
0 & \delta_2 & ... & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & ... & \delta_p
\end{bmatrix} =
\begin{bmatrix}
\sqrt{\delta_1} & 0 & ... & 0\\
0 & \sqrt{\delta_2} & ... & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & ... & \sqrt{\delta_p}
\end{bmatrix}
\begin{bmatrix}
\sqrt{\delta_1} & 0 & ... & 0\\
0 & \sqrt{\delta_2} & ... & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & ... & \sqrt{\delta_p}
\end{bmatrix}
$$
That is, we can split a diagonal matrix into the product of the diagonal matrix of the square roots of its elements -- the eigenvalues.

Therefore, we can rewrite 
$$\bm{\Sigma} = \mathbf{A}\bm{\Delta}\mathbf{A}^{T} = \mathbf{A}\sqrt{\bm{\Delta}}\sqrt{\bm{\Delta}}\mathbf{A}^{T}$$
Since we are post-multiplying $\mathbf{A}$ with $\sqrt{\bm{\Delta}}$ and pre-multiplying $\mathbf{A}^T$ with $\sqrt{\bm{\Delta}}$, we conveniently end up with a matrix and its transpose, which we'll call $\bm{\Lambda}^*$:
$$\bm{\Sigma} = \mathbf{A}\sqrt{\bm{\Delta}}\sqrt{\bm{\Delta}}\mathbf{A}^{T} = \bm{\Lambda}^*\bm{\Lambda}^{*T}$$

This looks almost exactly like what we have on the right-hand side of the factor analytic model! All that's left to do is pull out the uniquenesses.

### Moving from $\bm{\Lambda}^*\bm{\Lambda}^{*T}$ to $\bm{\Lambda}\bm{\Lambda}^{T} + \bm{\Psi}$

We'll walk through an iterative approach to estimate the factor loadings. Our goal is to move from $\bm{\Lambda}^*\bm{\Lambda}^{*T}$ to $\bm{\Lambda}\bm{\Lambda}^{T} + \bm{\Psi}$. To do this, first note that $\bm{\Lambda}^*$ is of dimension $p$, because we perform the eigendecomposition on the correlation matrix of the observed variables, of which there are $p$. However, our goal is to reduce its dimensionality to $q$: we assume that there are fewer factors than there are items, so that the same factor contributes to multiple items.

We'll discuss later how to choose the number of factors, but right now assume $q = 4$.

First, we'll use the `eigen()` function in `R` to perform the eigendecomposition.

```{r}
sigma <- cor(cesd)
eigenv <- eigen(sigma, symmetric = TRUE)
```

The resulting object contains the eigenvectors (`eigenv$vectors`) and eigenvalues (`eigenv$values`).

We can create the $\bm{\Lambda}^*$ matrix by multiplying the matrix of eigenvectors by the square root of the eigenvalues (that is, $\bm{\Lambda}^* = \mathbf{A}\sqrt{\bm{\Delta}}$).
```{r}
lambda_star <- eigenv$vectors %*% sqrt(diag(eigenv$values))
```

We can confirm that $\bm{\Lambda}^*\bm{\Lambda}^{*T} = \bm{\Sigma}$:
```{r}
(lambda_star %*% t(lambda_star))[1:5, 1:5]
sigma[1:5, 1:5]
```

However, this doesn't really help us: we need to take our first step toward pulling out $\bm{\Psi}$ by reducing the dimensionality of $\bm{\Lambda}^*$. Instead of using all of the eigenvalues to create $\bm{\Lambda}^*$, we will only use the first $q$. The eigenvalues are ordered by size -- in essence, how much of the variability in the data they represent -- so the first $q$ are the most important. We won't be losing that much information about the data by representing it with only the top $q$ eigenvalues (this is also the idea behind the dimension-reduction method of principal components analysis).

```{r}
q <- 4
lambda_star <- eigenv$vectors[,1:q] %*% sqrt(diag(eigenv$values[1:q]))
```

Note that $\bm{\Lambda}^*$ is now a $p \times q$ matrix, whereas before it was $p \times p$ (this is why we need $q < p$: we can't *add* information to our observed data, only take it away).

Not only are we reducing the dimension of $\bm{\Lambda}^*$, we also need to get rid of the part that belongs to $\bm{\Psi}$. Recall that $\bm{\Psi}$ is a diagonal matrix.

We call the diagonal elements of $\bm{\Lambda}\bm{\Lambda}^T$ the **communalities**. The communality for a given item, $h^2_j$, represents the total proportion of the variance of each $x_j$ that is explained by the factors (you can think of this as something like the $\alpha$ we estimated for tests for single latent variable, or $R^2$ in linear regression). 

We calculate the communality as the sum of the squared loadings for that item:
$$h^2_j = \sum_{k = 1}^q\lambda^2_{jk}$$
The rest of the variance of the item is assumed to be random error: the part contributed by $\bm{\Psi}$. We call the diagonal elements of $\bm{\Psi}$ the **uniquenesses**. The uniquenesses must be what we get when we subtract the communalities from the diagonal of the correlation matrix, of which each element is 1:
$$u_j^2 = 1 - h_j^2$$
To estimate the communalities, we are going to "recreate" the correlation matrix as the product of our new $\bm{\Lambda}^*$ and its transpose:
$$\bm{\Sigma}^* = \bm{\Lambda}^*\bm{\Lambda}^{*T}$$

```{r}
sigma_star <- lambda_star %*% t(lambda_star)
dim(sigma_star)
```
Like the original correlation matrix $\bm{\Sigma}$, $\bm{\Sigma}^*$ is still $p \times p$. However, since we're only calculating it from $q$ eigenvalues, we haven't explained all of the variability in the data -- we are assuming the rest, which is not explained by the $q$ factors, is the random error. Therefore, the diagonal elements of this new matrix $\bm{\Sigma}^*$ represent the communalities.^[Try writing out the matrix multiplication of the loading matrix $\bm{\Lambda}^*\bm{\Lambda}^{*T}$ to show that each element of the diagonal corresponds to $h^2_j$.]

We are going to replace the diagonal of the old correlation matrix $\bm{\Sigma}$ (which used to be all 1's) with the communalities from the new $\bm{\Sigma}^*$ matrix:
```{r}
diag(sigma) <- diag(sigma_star)
```
That way, the difference between the real correlation matrix and the one we created with $\bm{\Lambda}^*\bm{\Lambda}^{*T}$ will be our estimate of $\bm{\Psi}$.

### The iterative process
We're not done. We need to iterate through this process until in converges. We'll use `R` to repeat the process (starting from the beginning) until the communalities stay consistent from one iteration to the next.

```{r}
# start off with the original correlation matrix
sigma_star_old <- cor(cesd)
# arbitrary starting values for the communalities
h_new <- 100
h_old <- 0

# this tells R to repeat what's in the brackets until there's a very, very small
# difference in between the communalities from one iteration (h_old) and the next (h_new)
while(abs(h_new) - abs(h_old) > 0.00001){
  
  # first we calculate the communalities from the current sigma_star
  h_old <- sum(diag(sigma_star_old))
  
  # we do an eigendecomposition
  eigen_old <- eigen(sigma_star_old, symmetric = TRUE)
  
  # and calculate new values for lambda from the first q eigenvalues
  lambda <- eigen_old$vectors[,1:q] %*% sqrt(diag(eigen_old$values[1:q]))
  
  # use the new lambda to calculate a new sigma_star
  sigma_star_new <- lambda %*% t(lambda)
  
  # replace the old diagonal of sigma_star with the new diagonal
  diag(sigma_star_old) <- diag(sigma_star_new)
  
  # calculate the new communalities
  h_new <- sum(diag(sigma_star_new))
}
```

Now we should have a stable estimate of $\bm{\Lambda}\bm{\Lambda}^T$, saved as `sigma_star_old`. We can confirm that the 1's on the diagonal of the correlation matrix (the total variance of each item) have been replaced with values \textless 1 -- the communalities -- representing the proportion of the variance explained by the factors. Whatever is left over is our estimate of $\bm{\Psi}$!

```{r}
cor(cesd)[1:5, 1:5]
sigma_star_old[1:5, 1:5]
```

However, we don't really care about $\bm{\Psi}$. We can look at the values of `lambda`, which is the $p \times q$ matrix of the factor loadings:
```{r}
dim(lambda)
lambda[1:5,]
```

These aren't very interpretable as they are (we'll use built-in `R` functions below to fit rotated models, which are easier to interpret), but each value $\lambda_{jk}$ is the "amount" that a factor $k$ contributes to an item $j$.

## What just happened?
To review, we just used the following algorithm to estimate the factor loadings:

Using $m$ to index iterations of this algorithm, we:

1. Begin with the eigendecomposition of the original correlation matrix $\bm{\Sigma}$,
which we will designate $\bm{\Sigma}_{(0)}$.

2. Obtain current estimates of the factor loadings, $\bm{\Lambda}_{(m)}$ from the first $q$ eigenvectors and eigenvalues.

3. Use the current estimates of the factor loadings to predict the correlation matrix
\[ \bm{\Sigma}^{*}_{(m)} = \bm{\Lambda}_{(m)}\bm{\Lambda}_{(m)}^T \]

4. Replace the diagonal of $\bm{\Sigma}_{(0)}$ with the diagonal of $\bm{\Sigma}^{*}_{(m)}$ to get a new $\bm{\Sigma}_{(m)}$.

5. Perform the eigendecomposition of $\bm{\Sigma}_{(m)}$.

6. Repeat steps 2-5 until the change in the sum of the communalities $\left[\mbox{trace}(\bm{\Sigma}^{*}_{(m)})-\mbox{trace}(\bm{\Sigma}^{*}_{(m-1)})\right]$ is negligible.^[The trace of a square matrix is the sum of its diagonal elements.]

# Model fitting: other approaches

There are other approaches to fitting factor analytic models -- we don't always need to do it by hand (and we probably don't want to)! These approaches also make it easy to calculate the best rotation for the factor loadings (see below).

## Maximum likelihood factor analysis
The maximum likelihood solution finds those communality values that minimize the chi square goodness of fit test. The `fa()` function in the `psych` package can perform maximum likelihood factor analysis if we request `fm = "ml"`. Note that there is also a function called `factanal` included in the `stats` package in base `R` that fits maximum likelihood factor analysis.

The code below shows three different types of rotation using the maximum likelihood approach. (What we did by hand corresponds to `rotate = "none"`.) We'll discuss what rotation does in a bit.
```{r factorml}
# FA by maximum likelihood, no rotation (factors are orthogonal)
cesd.ml.none <- fa(cor(cesd), nfactors = q, n.obs = nrow(cesd), 
                   fm = "ml", rotate = "none")

# FA by maximum likelihood, varimax rotation (factors are orthogonal)
cesd.ml.varimax <- fa(cor(cesd), nfactors = q, n.obs = nrow(cesd), 
                      fm = "ml", rotate = "varimax")

# FA by maximum likelihood, oblimin rotation (factors are oblique)
cesd.ml.oblimin <- fa(cor(cesd), nfactors = q, n.obs = nrow(cesd), 
                      fm = "ml", rotate = "oblimin")
```

## Factor analysis with the `minres` solution
The minimum residual (`minres`) solution is an unweighted least squares solution that uses the `optim` function and adjusts the diagonal elements of the correlation matrix to minimize the squared residual when the factor model is the eigenvalue decomposition of the reduced matrix. The `minres` solution is similar to principal axis factor analysis and both methods will often work when maximum likelihood does not. 

The code below shows the three main types of rotation with this solution.
```{r factorminres}
# FA by minres, no rotation (factors are orthogonal)
cesd.minres.none <- fa(cor(cesd), nfactors = q, fm = "minres", rotate = "none")

# FA by minres, varimax rotation (factors are orthogonal)
cesd.minres.varimax <- fa(cor(cesd), nfactors = q, fm = "minres", rotate = "varimax")

# FA by minres, varimax rotation (factors are orthogonal)
cesd.minres.oblimin <- fa(cor(cesd),nfactors = q, fm = "minres", rotate = "oblimin")
```

# Modeling choices
## Choosing the number of factors
One way to get a sense of how many factors to use is to look at the scree plot, the plot of the eigenvalues. 

```{r scree, fig.height=3.75, fig.align="center", fig.width=3.75}
scree(cesd, factors = F)
```

A somewhat old-fashioned way to determine the number of factors is to use the Kaiser rule, which drops all components with eigenvalues under 1.0. The idea is that we don't want factors that have less information than we would expect the average single item to have, because we are trying to reduce our data.^[The sum of the eigenvalues of a matrix is equal to the trace of the matrix. We started with the correlation matrix, which has 1 as each of its diagonal elements, so its trace is $p$, the number of items. Therefore the eigenvalues will sum to $p$.]). The Kaiser criterion is the default in SPSS and most statistical software but is not recommended when used as the sole cut-off criterion for estimating the number of factors, as it tends to over-extract factors.

Another criterion is to look for the "elbow" in the scree plot and to exclude all components after the component where the "elbow" starts. 

More modern criteria, include Horn's Parallel Analysis (1965) and Velicer's Minimum Average Partial (MAP) criterion, are implemented in the `psych` package. 

```{r parallel, eval = F}
# Horn's Parallel analysis suggests 7 factors
fa.parallel(cor(cesd), n.obs = nrow(cesd), fm = "ml", fa = "fa")

# Velicer's MAP criterion suggests 2 factors
vss(cor(cesd), n.obs = nrow(cesd), fm = "ml")
```

Choosing the number of factors is somewhat of an art \ldots.

## Rotation

### Orthogonal rotation
We noted earlier that the factor loadings were not particularly interpretable. When fitting the model as we did above, we have forced the first factor to explain most of the variability in the items, then the second factor to explain as much as possible of what's left over, and so on. However, that's not really what we expect the factors to do. We still want them to explain as much of the variability in the items as possible, we just don't feel the need to insist that the first factor be so ambitious in its explanatory power. Instead, we can "rotate" the loading matrix so that the items are as closely aligned to a single factor as possible. When we do an *orthogonal* rotation, we maintain the lack of correlation between the factors that we assumed in the model above. We're still explaining the same amount of variation, it's just spread across the factors in a way that minimizes the distance between the factors and the items. This means that the loadings for a given item will be as close to 0 for as many of the factors as possible (and hopefully large for just a single factor).

![](rotation)\

Let's look at the factor loadings before rotation. We'll (somewhat arbitrarily) only print off the loadings above 0.35 to make it easier to read:
```{r}
# before rotation
print(cesd.ml.none$loadings, cut = .35)
```
Note that most items are loading only on the first factor -- we've forced them to do so. Note also the `SS loadings` row: these are the eigenvalues from the $\bm{\Lambda}\bm{\Lambda}^T$ matrix. These are clearly ordered so that the eigenvalue corresponding to the first factor explains by far the most variation in the data (`Proportion Var` row).

Now we'll look at the factor loadings from a solution after an orthogonal rotation:
```{r}
# orthogonal rotation
print(cesd.ml.varimax$loadings, cut = .35)
```
Now we see that the number of items corresponding to each factor is more evenly distributed across the factors. This is also reflected in the fact that each explains a somewhat more equal proportion of the variance.

Now we can start to interpret them. Factor 1 looks like it corresponds to what was described in the original paper as "depressed affect", factor 2 to "positive affect", factor 3 to "somatic and retarded activity", and factor 4 to "interpersonal".^[However, Radloff (1977) noted the high internal consistency of the scale as a whole, and recommended against "undue emphasis on separate factors" for the use of the total score "as an estimate of the degree of depressive symptomatology".]

### Oblique rotation
We could instead assume that the factors are correlated in some way. This may be more realistic -- for example, in the CES-D, we may expect that "positive affect" and "depressed affect" are correlated. However, it also means the covariance matrix for the factors, $\Cov(\mathbf{F}_i)$, doesn't drop out like it did when we worked through the model above. 

![](rotation2)\

That's ok, though! We just have another matrix to estimate: the correlation matrix for the factors. Let's look at the solution with an oblique rotation:

```{r}
# oblique rotation
print(cesd.ml.oblimin$loadings, cut = .35)
```

These item-factor pairs correspond almost exactly to what we saw on the lecture slides.

We can also look at the estimated correlation matrix for the factors:
```{r}
cesd.ml.oblimin$Phi
```
It looks like the factors corresponding to "depressed affect" and "somatic and retarded activity" are highly correlated, while "positive affect" is not very correlated with either "somatic and retarded activity" or "interpersonal".

# Cronbach's alpha: internal consistency reliability

Based on Radloff's (1977) recommendation, let's proceed as though we can treat the total CES-D score as a valid measurement of a single latent construct (depressive symptomatology). We asked you to compute Cronbach's alpha as a measure of internal consistency reliability.

Recall that we can calculate Cronbach's alpha from the covariance matrix by hand. If $Y$ is a scale consisting of $k$ items, $X_1, X_2, \dots, X_k$, then we calculate
$$ \alpha = \frac{k}{k-1} \left[ 1 - \frac{\sum_{i=1}^{k} \sigma_i^2}{\sigma_y^2}\right] $$
where $\sigma_i^2$ are the item variances and $\sigma_y^2$ is the total variance of the scale (which we can obtain by summing all the elements of the covariance matrix).

Instead of calculating this by hand for the measurement project, however, we can use the `alpha()` function in the `psych` package to compute Cronbach's alpha for us. The `alpha()` function produces a lot of output, so first we'll pull out the Cronbach's alpha statistic that we are interested in.

```{r}
# note that we have to invoke the alpha function using psych::alpha
# because R has other functions called alpha
psych::alpha(cesd)$total[1]
```
* How do you interpret this Cronbach's alpha?

A few helpful notes about the behavior of the `alpha()` function:

* `alpha()` requires non-missing data, so if you should exclude respondents with missing data (or use multiple imputation to fill in the missing data)
* You will generally want to recode reverse-coded items so that they are going the same direction as the other items. You can check the correlation matrix for negative correlations to see whether any items are behaving this way.
* `alpha()` does have an option `check.keys=TRUE` which will ``find the first principal component and reverse key items with negative loadings.'' It will give a warning of this happens.
* While Cronbach's alpha is the most widely reported measure of internal consistency reliability (probably because it is easy to calculate and is implemented in most software packages) it has been noted that it ``underestimates the reliability of a test and over estimates the first factor saturation.'' There are other measure of reliability that have better properties for different situations (see Revelle Chapter 7 for more information).^[o	Revelle W, An introduction to psychometric theory with applications in R (Under development).  Chapter 7:  Reliability. Available at: https://www.personality-project.org/r/book/]

# Raw and corrected item correlations

The `alpha()` function actually produces a lot of useful output that we can examine to see how the individual items are performing. 

The raw and corrected item correlations appear under `Item Statistics` in the output. These are correlations of the individual items with the full scale. The raw item correlations (`raw.r` in the output below) are correlations between individual items and the full scale **including** the individual item, i.e. not correcting for item overlap. The corrected item correlations (`r.drop`) are the correlations of the item with the scale composed of the remaining items.

Also take a look at the output under `Reliability if an item is dropped`. The `raw_alpha` column shows Cronbach's alpha computed after dropping the item.

```{r}
psych::alpha(cesd)
```

* What happens to Cronbach's alpha when leaving out an item that has a low item correlation?

* How would you use the raw and corrected item correlations to help you decide what items to include in your final scale?