Modern applied sciences (including but not limited to economics, medicine, or engineering) rely heavily on asymptotic theory. It allows applied scientists to make quantitative claims about large populations based on limited samples. Most importantly, the superpower of asymptotic theory is providing a near-perfect (asymptotic) mathematical framework to evaluate and quantify the precision of those claims about the true/unknown populations. Without it, we would mostly be guessing with no reliable foundation to evaluate how good or bad those guesses are.
Here we learn the foundation of asymptotic statistical inference. Note by note.
This note explains some basic concepts and how to compute the variance of the sample mean step by step. We begin with the core foundation then move carefully through the math.
Suppose we take a sample of $n$ random variables from the population: $$ X_1, X_2, \dots, X_n. $$
We assume:
These three assumptions describe the population from which we obtain our sample of $n$ random variables. In practice, we rarely observe the entire population. Moreover, the population is often dynamic rather than fixed, changing, for example, over time. From a practical standpoint, thinking of the population as a data-generating process doesn't seem a crazy idea. From a theoretical standpoint, defining a population as a data-generating process is extremely useful. In fact, modern statistics adopts this notion of population to develop asymptotic theory.
It is helpful to understand the basic concepts of $\mathbb{E}$ and $\mathrm{Var}$.
$\mathbb{E}[x]$ is the expected value (or mean) of the random variable $x$. This is a population concept as it requires the whole population to derive. If $x$ is a discrete random variable with probability mass function $p(x)$, then $$ \mathbb{E}[x] = \sum_x x \, p(x). $$ If $x$ is a continuous random variable with probability density function $f(x)$, then $\mathbb{E}[x] = \int x \, f(x)\, dx$.
$\mathrm{Var}(x)$ is called the variance of the random variable $x$. It measures how spread out the values of $x$ are around their mean. The variance is defined as $$ \mathrm{Var}(x) = \mathbb{E}\left[(x - \mathbb{E}[x])^2\right]. $$ Therefore, it is also a population concept.
Since $\mathbb{E}[x]$ is a constant, a special case of a random variable where one specific value has the probability mass of one, $\mathbb{E}\left[\mathbb{E}[x]\right]=\mathbb{E}[x]$. We can simplify the variance further: $$ \begin{aligned} \mathrm{Var}(x) &= \mathbb{E}\left[x^2 - 2x\mathbb{E}[x] + \left(\mathbb{E}[x]\right)^2\right] \\ &= \mathbb{E}[x^2] - \left(\mathbb{E}[x]\right)^2. \end{aligned} $$ From here we can also see that $\mathrm{Var}(cx)=c^2 \mathrm{Var}(x)$ where $c$ is a constant.
The assumption we made about the population is equivalent to:
For a given set of $n$ variables, the corresponding sample mean is defined as $$ \bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i. $$
Since each element in the sample is a random variable, the sample mean is also a random variable. By the definition of variance, $$ \begin{aligned} \mathrm{Var}(\bar X_n) &= \mathbb{E}\left[\bar X_n^2\right] - \left(\mathbb{E}[\bar X_n]\right)^2 \\ &= \mathbb{E}\left[\left(\frac{1}{n}\sum_{i=1}^n X_i\right)^2\right] - \mu^2 \\ &= \frac{1}{n^2} \mathbb{E}\left[\left(\sum_{i=1}^n X_i\right)^2\right] - \mu^2 \\ &= \frac{1}{n^2} \mathbb{E}\!\left[\sum_{i=1}^n X_i^2 + 2\sum_{i<j} X_i X_j \right] - \mu^2 \end{aligned} $$
Intuitively, independence means that knowing the value of $X_i$ provides no information about the value of $X_j$. As a result, their joint behavior separates into individual components.
Formally, the definition of independence is:
In somewhat lengthy and repetitive following notes, we will show that when $X_i$ and $X_j$ are independent, the mean of their product is the product of their means: $$ \mathbb{E}[X_i X_j] = \mathbb{E}[X_i]\mathbb{E}[X_j]. $$
Discrete case. $$ \begin{aligned} \mathbb{E}[X_i X_j] &= \sum_x \sum_y x y \, P(X_i = x, X_j = y) \\ &= \sum_x \sum_y x y \, P(X_i = x)\,P(X_j = y) \\ &= \left( \sum_x x \, P(X_i = x) \right) \left( \sum_y y \, P(X_j = y) \right) \\ &= \mathbb{E}[X_i]\mathbb{E}[X_j] \end{aligned} $$
Continuous case. $$ \begin{aligned} \mathbb{E}[X_i X_j] &= \int\!\!\int x y \, f_{X_i,X_j}(x,y)\,dx\,dy \\ &= \int\!\!\int x y \, f_{X_i}(x)\,f_{X_j}(y)\,dx\,dy \\ &= \left( \int x \, f_{X_i}(x)\,dx \right) \left( \int y \, f_{X_j}(y)\,dy \right) \\ &= \mathbb{E}[X_i]\mathbb{E}[X_j]. \end{aligned} $$
Mixed case. $$ \begin{aligned} \mathbb{E}[X_i X_j] &= \sum_x \int x y \, f_{X_i,X_j}(x,y)\,dy \\ &= \sum_x \int x y \, P(X_i = x)\,f_{X_j}(y)\,dy \\ &= \left( \sum_x x \, P(X_i = x) \right) \left( \int y \, f_{X_j}(y)\,dy \right) \\ &= \mathbb{E}[X_i]\mathbb{E}[X_j]. \end{aligned} $$
Now we can complete the derivation of the sample mean variance: $$ \begin{aligned} \mathrm{Var}(\bar X_n) &= \frac{1}{n^2} \mathbb{E}\!\left[\sum_{i=1}^n X_i^2 + 2\sum_{i<j} X_i X_j \right] - \mu^2 \\ &= \frac{1}{n^2} \left( \sum_{i=1}^n \mathbb{E}[X_i^2] + 2\sum_{i<j} \mathbb{E}[X_i]\mathbb{E}[X_j] \right) - \mu^2 \\ &= \frac{1}{n^2} \left( n(\sigma^2 + \mu^2) + n(n-1)\mu^2 \right) - \mu^2 \\ &= \frac{\sigma^2}{n}. \end{aligned} $$
Or equivalently, $$\mathrm{Var}(\sqrt{n} \bar X_n)= \sigma^2.$$
We have come to accomplish that the variance of the sample mean shrinks as $n$ increases. However, if we "magnify" the sample mean with the scale of $\sqrt{n}$, the variance becomes exactly as the variance of $X_i$.
In probability theory, there are several notions of convergence of sequences of random variables. The the hierarchy of convergence is: Sure → Almost Surely → In Probability → In Distribution.
A sequence of random variables $X_1, X_2, X_3, \dots$ converges in distribution to a random variable $X$ if: $$\lim_{n\to\infty} F_n(x) = F(x)$$ for every $x \in \mathbb{R}$ at which $F$ is continuous.
In everyday words, when $n$ goes to infinity, the random variable $X_n$ virtually has the same distribution as the limit random variable $X$. Convergence in distribution is also called weak convergence.
A sequence of random variables $X_1, X_2, X_3, \dots$ converges in probability to a random variable $X$ (or a constant $c$) if, for every positive number $\epsilon > 0$: $$\lim_{n\to\infty} P(|X_n - X| \ge \epsilon) = 0$$
In everyday words, when $n$ goes to infinity, the random variable $X_n$ becomes virtually the limit random variable $X$.
Suppose $X_1, X_2, X_3, \dots, X_n$ is a sequence of i.i.d. random variables with $\mathrm{E}[X_i] = \mu$ and $\mathrm{Var}(X_i) = \sigma^2 < \infty$. Then, as $n$ approaches infinity, the random variables $\sqrt{n}(\bar{X}_n - \mu)$ converge in distribution to a normal $\mathcal{N}(0, \sigma^2)$: $$\sqrt{n} \left( \bar{X}_n - \mu \right) \xrightarrow{d} \mathcal{N}(0, \sigma^2).$$
Note: For each $n$, the computer runs 2,000 independent trials. In each trial, it draws $n$ random variables from the selected population and calculates their sample mean, $\bar{X}_n$. The bars above show the distribution of those 2,000 sample means.
A note on terminologies: Univariate analysis concerns a single variable, bivariate analysis concerns two, and multivariate analysis concerns three or more variables. The simplest form of regression is bivariate, focusing on the relationship between a dependent variable $y$ and an independent variable $x$.
It is helpful to rewrite the CLT so that left-hand side is a random variable with mean of zero:
Let $\{Z_i\}_{i=1}^n$ be i.i.d. random variables such that
$$ \mathbb E[Z_i] = 0, \qquad \mathbb E[Z_i^2] = \tau^2 < \infty. $$
The classical central limit theorem (CLT) states:
$$ \frac{1}{\sqrt n}\sum_{i=1}^n Z_i \;\xrightarrow{d}\; \mathcal N(0,\tau^2). $$
Consider the model $$ y_i = \beta_0 + \beta_1 x_i + u_i, \qquad i = 1,\dots,n. $$ with assumptions:
Now we will carefully go through the derivation of the OLS asymptotic normality
The OLS estimator of $\beta_1$ is $$ \hat\beta_1 = \frac{\sum_{i=1}^n (x_i-\bar x)(y_i-\bar y)} {\sum_{i=1}^n (x_i-\bar x)^2}. $$
Substituting $y_i = \beta_0 + \beta_1 x_i + u_i$, $$ \hat\beta_1 - \beta_1 = \frac{\sum_{i=1}^n (x_i-\bar x)u_i} {\sum_{i=1}^n (x_i-\bar x)^2}. $$
Equivalently, $$ \sqrt n(\hat\beta_1 - \beta_1) = \frac{\frac{1}{\sqrt n}\sum_{i=1}^n (x_i-\bar x)u_i} {\frac{1}{n}\sum_{i=1}^n (x_i-\bar x)^2}. $$
As $n$ goes to infinity, $\bar{x}$ converges (in probability) to $\mu_x$.
Let's define the new random variable $Z_i$ to apply the CLT: $$ Z_i = (x_i-\mu_x)u_i. $$ The first moment is zero, which satisfies the CLT requirement ($\mathbb E[Z_i] = 0$). Let's find the second moment. $$ \begin{aligned} \mathbb E[Z_i^2] &= \mathbb E[(x_i - \mu_x)^2 u_i^2] && \\ &= \mathbb E \left[ \mathbb E[(x_i - \mu_x)^2 u_i^2 \mid x_i] \right] && \text{Law of Iterated Expectations} \\ &= \mathbb E \left[ (x_i - \mu_x)^2 \mathbb E[u_i^2 \mid x_i] \right] && \text{Conditioning on } x_i \\ &= \mathbb E[(x_i - \mu_x)^2 \sigma^2] && \text{Homoskedasticity Assumption} \\ &= \sigma^2 \mathbb E[(x_i - \mu_x)^2] && \\ &= \sigma^2 \operatorname{Var}(x_i) && \end{aligned} $$
Combining them via Slutsky's Theorem: $$ \sqrt n(\hat\beta_1 - \beta_1) \xrightarrow{d} \mathcal N\left(0, \frac{\sigma^2}{\mathrm{Var}(x_i)}\right) $$ If we relax the heteroskedasticity assumption, the result becomes $$ \sqrt n(\hat\beta_1 - \beta_1) \xrightarrow{d} \mathcal N\left(0, \frac{\mathbb E[(x_i - \mu_x)^2 u_i^2]}{(\operatorname{Var}(x_i))^2}\right). $$
The OLS slope estimator is asymptotically normal because it is a scaled sample mean of the random variables $(x_i - \mathbb E[x_i])u_i$. The classical CLT, combined with the law of large numbers and Slutsky’s theorem, fully explains this result.
The derivation for the intercept is slightly more complex than the slope because the estimation of $\hat\beta_0$ depends directly on the error in estimating $\hat\beta_1$. We begin with the OLS formula for the intercept: $$ \hat\beta_0 = \bar y - \hat\beta_1 \bar x $$
Since $y_i = \beta_0 + \beta_1 x_i + u_i$, then:
$$ \begin{aligned} \hat\beta_0 - \beta_0 &= (\beta_0 + \beta_1 \bar x + \bar u) - \hat\beta_1 \bar x - \beta_0 \\ &= \bar u - (\hat\beta_1 - \beta_1)\bar x. \end{aligned} $$
From the previous note we have: $$ \sqrt n(\hat\beta_1 - \beta_1) = \frac{\frac{1}{\sqrt n}\sum_{i=1}^n (x_i-\bar x)u_i} {\frac{1}{n}\sum_{i=1}^n (x_i-\bar x)^2}. $$ Therefore, $$ \sqrt n(\hat\beta_0 - \beta_0) \xrightarrow{p} \frac{1}{\sqrt n} \sum_{i=1}^n \left[ 1 - \frac{\mu_x(x_i - \mu_x)}{\sigma_x^2} \right] u_i. $$ Here we renamed $\operatorname{Var}(x_i)$ to $\sigma_x^2$ to remove the subscript $i$ as the variance is constant over $i$.
Let $W_i = [1 - g(x_i)]u_i$ where $g(x_i) = \frac{\mu_x(x_i - \mu_x)}{\sigma_x^2}$. Again, the first moment is zero, which satisfies the CLT requirement ($\mathbb E[W_i] = 0$). Let's find the second moment.
$$ \begin{aligned} \mathbb E[W_i^2] &= \sigma^2 \mathbb E \left[ \left( 1 - \frac{\mu_x(x_i - \mu_x)}{\sigma_x^2} \right)^2 \right] && \text{Homoskedasticity} \\ &= \sigma^2 \mathbb E \left[ 1 - \frac{2\mu_x(x_i - \mu_x)}{\sigma_x^2} + \frac{\mu_x^2(x_i - \mu_x)^2}{(\sigma_x^2)^2} \right] && \text{} \\ &= \sigma^2 \left( 1 - 0 + \frac{\mu_x^2 \sigma_x^2}{(\sigma_x^2)^2} \right) && \text{Since } \mathbb E[x_i - \mu_x] = 0 \\ &= \sigma^2 \left( 1 + \frac{\mu_x^2}{\sigma_x^2} \right) && \text{} \end{aligned} $$
Recalling from Note 1 that $\sigma_x^2 = \mathbb E[x_i^2] - \mu_x^2$, we further simplify the term in the parentheses to $\frac{\mathbb E[x_i^2]}{\sigma_x^2}$.
Applying the CLT and Slutsky's Theorem: $$ \sqrt n(\hat\beta_0 - \beta_0) \xrightarrow{d} \mathcal N \left( 0, \frac{\sigma^2 \mathbb E[x_i^2]}{\operatorname{Var}(x_i)} \right). $$ If we relax the heteroskedasticity assumption, the result becomes $$ \sqrt n(\hat\beta_0 - \beta_0) \xrightarrow{d} \mathcal N \left( 0, \mathbb E \left[ \left( 1 - \frac{\mu_x(x_i - \mu_x)}{\operatorname{Var}(x_i)} \right)^2 u_i^2 \right] \right) $$
Intuition: In the homoskedasticity case, the variance of the intercept is always larger than the variance of the slope (scaled) unless $\mu_x = 0$. This is because the intercept's uncertainty is a combination of the uncertainty in the average level ($\bar u$) and the uncertainty in the slope.
Notice that in the asymptotic distributions in Notes 4 and 5, the variances are expressed in terms of population parameters: $\sigma^2$ (the variance of the error term), $\operatorname{Var}(x_i)$, and $\mathbb{E}[\cdot]$. By invoking Slutsky’s Theorem and the Law of Large Numbers (LLN) once again, we can replace these population values with consistent sample estimators. An estimator is consistent if, as the sample size $n$ increases, the estimate converges in probability to the true population value. As long as the estimators used are consistent, the asymptotic distribution remains unchanged.
The population error $u_i$ is unobservable. However, once we have calculated the OLS estimates $\hat\beta_0$ and $\hat\beta_1$, we can calculate the OLS residuals, denoted as $\hat u_i$: $$ \hat u_i = y_i - \hat\beta_0 - \hat\beta_1 x_i $$
While $\hat u_i \neq u_i$ in finite samples, $\hat u_i$ is a consistent estimator of $u_i$ as $\hat\beta \xrightarrow{p} \beta$. We use these residuals to construct our sample analogs, which by construction, are consistent.
| Population Term | Sample Analog (Using $\hat u_i$) |
|---|---|
| $\operatorname{Var}(x_i)$ | $\hat \sigma_x^2 = \frac{1}{n} \sum (x_i - \bar x)^2$ |
| $\sigma^2$ (Homoskedastic) | $\hat \sigma^2 = \frac{1}{n} \sum \hat u_i^2$ |
| $\mathbb E[(x_i - \mu_x)^2 u_i^2]$ | $\frac{1}{n} \sum (x_i - \bar x)^2 \hat u_i^2$ |
| $\mathbb E \left[ \left( 1 - \frac{\mu_x(x_i - \mu_x)}{\operatorname{Var}(x_i)} \right)^2 u_i^2 \right]$ | $\frac{1}{n} \sum \left( 1 - \frac{\bar x (x_i - \bar x)}{\hat \sigma_x^2} \right)^2 \hat u_i^2$ |
| Theoretical Population Form | Sample Analog Form |
|---|---|
| Case 1: Under Homoskedasticity | |
| $$ \displaystyle \sqrt n(\hat\beta_1 - \beta_1) \xrightarrow{d} \mathcal N \left( 0, \frac{\sigma^2}{\operatorname{Var}(x_i)} \right) $$ | $$ \displaystyle \sqrt n(\hat\beta_1 - \beta_1) \xrightarrow{d} \mathcal N \left( 0, \frac{\hat \sigma^2}{\hat \sigma_x^2} \right) $$ |
| $$ \displaystyle \sqrt n(\hat\beta_0 - \beta_0) \xrightarrow{d} \mathcal N \left( 0, \frac{\sigma^2 \mathbb E[x_i^2]}{\operatorname{Var}(x_i)} \right) $$ | $$ \displaystyle \sqrt n(\hat\beta_0 - \beta_0) \xrightarrow{d} \mathcal N \left( 0, \frac{\hat \sigma^2 (\frac{1}{n}\sum x_i^2)}{\hat \sigma_x^2} \right) $$ |
| Case 2: Under Heteroskedasticity (Robust) | |
| $$ \displaystyle \sqrt n(\hat\beta_1 - \beta_1) \xrightarrow{d} \mathcal N \left( 0, \frac{\mathbb E[(x_i - \mu_x)^2 u_i^2]}{(\operatorname{Var}(x_i))^2} \right) $$ | $$ \displaystyle \sqrt n(\hat\beta_1 - \beta_1) \xrightarrow{d} \mathcal N \left( 0, \frac{\frac{1}{n} \sum (x_i - \bar x)^2 \hat u_i^2}{(\hat \sigma_x^2)^2} \right) $$ |
| $$ \displaystyle \sqrt n(\hat\beta_0 - \beta_0) \xrightarrow{d} \mathcal N \left( 0, \mathbb E[W_i^2] \right) $$ | $$ \displaystyle \sqrt n(\hat\beta_0 - \beta_0) \xrightarrow{d} \mathcal N \left( 0, \frac{1}{n} \sum \left[ 1 - \frac{\bar x (x_i - \bar x)}{\hat \sigma_x^2} \right]^2 \hat u_i^2 \right) $$ |
Note that while the $1/n$ scaling is appropriate for asymptotic consistency, practical applications with smaller samples often require degrees-of-freedom corrections (such as $n-2$ for simple linear regression). These adjustments ensure our standard error estimates remain reliable when the sample size is limited. We will cover this topic in detail in the later notes.
In Note 1 we talked about a single random variable $X_i$ and characterised it with two numbers: its mean $\mu$ and its variance $\sigma^2$. As long as we work with one variable at a time, variance is very helpful to describe the spread. But in multivariate regressions, we have more than one independent variable, say $x^{(1)}, x^{(2)}, \dots$ and it is natural to ask: do they move together? If $x^{(1)}$ is large, is $x^{(2)}$ also large, or small, or unrelated? This is where we will need to introduce a new notions called covariance and correlation.
Imagine we observe two variables for each individual $i$: their years of education $X_i$ and their income $Y_i$. Naturally, for a random individual, $X_i$ and $Y_i$ can be considered as two random variables. However it is also natural to expected that these two random variables may have some connection. People with more education tend to have higher income. So when $X_i$ is above its mean, $Y_i$ also tends to be above its mean. We want a single number that captures this tendency.
Recall from Note 1 that variance measures spread by looking at how far a variable departs from its own mean: $$ \mathrm{Var}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])^2\right]. $$ The key object here is the deviation from the mean, $X - \mathbb{E}[X]$. To capture whether two variables move together, we can look at whether their deviations from their respective means tend to have the same sign at the same time. That is the idea behind covariance.
Let $X$ and $Y$ be two random variables with means $\mathbb{E}[X] = \mu_X$ and $\mathbb{E}[Y] = \mu_Y$. The covariance of $X$ and $Y$ is defined as $$ \mathrm{Cov}(X, Y) = \mathbb{E}\!\left[(X - \mu_X)(Y - \mu_Y)\right]. $$
When the two variables are identical, $Y = X$, this reduces to $$ \mathrm{Cov}(X, X) = \mathbb{E}\!\left[(X - \mu_X)^2\right] = \mathrm{Var}(X). $$ So variance is a special case of covariance.
The product $(X - \mu_X)(Y - \mu_Y)$ can be positive or negative:
Taking the expectation over all possible realisations:
Suppose $X$ and $Y$ take the following values with equal probability $\frac{1}{4}$ each. We can compute the covariance directly from the definition.
| Outcome | $X$ | $Y$ | $X - \mu_X$ | $Y - \mu_Y$ | $(X-\mu_X)(Y-\mu_Y)$ |
|---|---|---|---|---|---|
| 1 | 1 | 2 | $-2$ | $-2$ | $4$ |
| 2 | 2 | 3 | $-1$ | $-1$ | $1$ |
| 3 | 4 | 5 | $+1$ | $+1$ | $1$ |
| 4 | 5 | 6 | $+2$ | $+2$ | $4$ |
| Mean | $\mu_X = 3$ | $\mu_Y = 4$ | — | — | $\mathrm{Cov} = \frac{4+1+1+4}{4} = 2.5$ |
Just as we simplified $\mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$ in Note 1, we can expand the definition of covariance: $$ \begin{aligned} \mathrm{Cov}(X, Y) &= \mathbb{E}\!\left[(X - \mu_X)(Y - \mu_Y)\right] \\ &= \mathbb{E}\!\left[XY - X\mu_Y - \mu_X Y + \mu_X \mu_Y\right] \\ &= \mathbb{E}[XY] - \mu_Y \mathbb{E}[X] - \mu_X \mathbb{E}[Y] + \mu_X \mu_Y \\ &= \mathbb{E}[XY] - \mu_X \mu_Y - \mu_X \mu_Y + \mu_X \mu_Y \\ &= \mathbb{E}[XY] - \mu_X \mu_Y. \end{aligned} $$ So equivalently:
This formula connects naturally to independence. From Note 1, we know that when $X$ and $Y$ are independent, $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$. Substituting directly: $$ \mathrm{Cov}(X, Y) = \mathbb{E}[X]\mathbb{E}[Y] - \mathbb{E}[X]\mathbb{E}[Y] = 0. $$ Note: Independence implies zero covariance. BUT NOT THE OPPOSITE. Zero covariance does not guarantee independence (See Note 1 for the definition of independence).
The following properties follow directly from the definition and the linearity of expectation. Let $X$, $Y$, $W$ be random variables and $a$, $b$, $c$, $d$ be constants.
Symmetry. $$ \mathrm{Cov}(X, Y) = \mathrm{Cov}(Y, X). $$ The order does not matter because $(X-\mu_X)(Y-\mu_Y) = (Y-\mu_Y)(X-\mu_X)$.
Covariance with a constant is zero. $$ \mathrm{Cov}(X, c) = 0. $$ A constant never deviates from its mean, so the product of deviations is always zero.
Linearity in each argument. $$ \mathrm{Cov}(aX + bY,\; cW) = ac\,\mathrm{Cov}(X, W) + bc\,\mathrm{Cov}(Y, W). $$
Variance of a sum. Using linearity: $$ \begin{aligned} \mathrm{Var}(X + Y) &= \mathrm{Cov}(X + Y,\; X + Y) \\ &= \mathrm{Cov}(X, X) + \mathrm{Cov}(X, Y) + \mathrm{Cov}(Y, X) + \mathrm{Cov}(Y, Y) \\ &= \mathrm{Var}(X) + 2\,\mathrm{Cov}(X, Y) + \mathrm{Var}(Y). \end{aligned} $$ When $X$ and $Y$ are independent (so $\mathrm{Cov}(X,Y)=0$), this reduces to the familiar $\mathrm{Var}(X+Y) = \mathrm{Var}(X) + \mathrm{Var}(Y)$ used in Note 1. When they are positively correlated, the variance of the sum is larger than the sum of the individual variances. On the other hand, when they are negatively correlated, the variance of the sum is less than the sum of the individual variances.
While the sign of covariance has a clear information, it's magnitude say little as it depends heavily on the unit of the variables. For example, if we measure income in the unit of dollars versus in the unit of thousands of dollars, the covariance changes by a factor of 1000. This makes it hard to judge whether a covariance of, say, 500 is large or small.
We can remove the units by dividing by the standard deviations of both variables. This gives the definition of correlation:
The correlation is always between $-1$ and $1$. A value of $+1$ means perfect positive linear relationship; $-1$ means perfect negative linear relationship; $0$ means no linear relationship. Unlike covariance, correlation is unit-free and directly comparable across different pairs of variables. Keep in mind that these are all about linear relationship.
Drag the slider to change the correlation ($\rho$). Use the $\omega$ box to scale $Y$ and observe that the slope of the line changes but the correlation does not. The dashed red line shows the direction of the linear trend.
The model. We draw $X \sim \mathcal{N}(0,1)$ and $Z \sim \mathcal{N}(0,1)$ independently, then construct $$ Y = \omega\!\left(\rho X + \sqrt{1-\rho^2}\,Z\right), $$ where $\omega > 0$ is the scale you set in the box above. We will show that the correlation between $X$ and $Y$ is actually $\rho$.
Since $X \sim \mathcal{N}(0,1)$, we have $\mathrm{Var}(X) = 1$.
Since $X$ and $Z$ are independent, we have $$ \mathrm{Var}(Y) = \omega^2\!\left[\rho^2\,\mathrm{Var}(X) + (1-\rho^2)\,\mathrm{Var}(Z)\right] = \omega^2\!\left[\rho^2 + (1-\rho^2)\right] = \omega^2. $$
Now we use the linearity property of covariance (mentioned above) and $\mathrm{Cov}(X,Z)=0$: \begin{aligned} \mathrm{Cov}(X, Y) &= \mathrm{Cov}\!\left(X,\;\omega\rho X + \omega\sqrt{1-\rho^2}\,Z\right) \\ &= \omega\rho\,\mathrm{Cov}(X,X) + \omega\sqrt{1-\rho^2}\,\mathrm{Cov}(X,Z) \\ &= \omega\rho \cdot 1 + 0 \\ &= \omega\rho \end{aligned}
Putting all together, $$ \mathrm{Corr}(X,Y) = \frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}(X)}\,\sqrt{\mathrm{Var}(Y)}} = \frac{\omega\rho}{\sqrt{1}\cdot\sqrt{\omega^2}} = \frac{\omega\rho}{\omega} = \rho. $$
In Note 4 we derived the asymptotic distribution of the OLS slope estimator when there is only one independent variable $x$. In practice, regressions almost always include several independent variables. Handling them one equation at a time would be very bulky, so we use matrix notation to write everything compactly.
This note has two goals: (1) introduce matrix notation from scratch, and (2) re-derive OLS and its asymptotic normality in the multivariate setting.
A column vector is simply a list of numbers stacked vertically. For example, if we observe the outcome $y$ for three individuals: $$ \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ y_3 \end{pmatrix}. $$ The bold lower-case letter $\mathbf{y}$ signals that this is a vector, not a single number. We say $\mathbf{y}$ has dimension $3 \times 1$ (three rows, one column).
A matrix is a table of numbers (in 2 dimentions). Suppose we have three observations and two independent variables $x^{(1)}$ and $x^{(2)}$. We can put all the data into a single table, adding a column of ones on the left to account for the intercept as follow
| Intercept | $x^{(1)}$ | $x^{(2)}$ | |
|---|---|---|---|
| Obs 1 | 1 | $x_1^{(1)}$ | $x_1^{(2)}$ |
| Obs 2 | 1 | $x_2^{(1)}$ | $x_2^{(2)}$ |
| Obs 3 | 1 | $x_3^{(1)}$ | $x_3^{(2)}$ |
We call this the design matrix: $$ \mathbf{X} = \begin{pmatrix} 1 & x_1^{(1)} & x_1^{(2)} \\ 1 & x_2^{(1)} & x_2^{(2)} \\ 1 & x_3^{(1)} & x_3^{(2)} \end{pmatrix}. $$ In general, with $n$ observations and $k$ columns (intercept plus $k-1$ variables), $\mathbf{X}$ is an $n \times k$ matrix.
The transpose of a matrix, written $\mathbf{X}^\top$, is obtained by turning every row into a column (and vice versa): $$ \mathbf{X} = \begin{pmatrix} 1 & x_1^{(1)} & x_1^{(2)} \\ 1 & x_2^{(1)} & x_2^{(2)} \\ 1 & x_3^{(1)} & x_3^{(2)} \end{pmatrix} \quad\Longrightarrow\quad \mathbf{X}^\top = \begin{pmatrix} 1 & 1 & 1 \\ x_1^{(1)} & x_2^{(1)} & x_3^{(1)} \\ x_1^{(2)} & x_2^{(2)} & x_3^{(2)} \end{pmatrix}. $$ The transpose of an $n \times k$ matrix is a $k \times n$ matrix. For a column vector $\mathbf{y}$ of dimension $n\times 1$, its transpose $\mathbf{y}^\top$ is a row vector of dimension $1\times n$.
The visualization below is to help you get a better idea of how the transposing works. The $3 \times 4$ matrix below flips along its main diagonal (the line from the top-left to the bottom-right corner). The directtion of the flip (clockwise or counter-clockwise) does not matter. Rows become columns and vice versa, and the dimensions change from $3 \times 4$ to $4 \times 3$. Two transposes make the matrix returns to the original form.
To multiply matrix $\mathbf{A}$ ($m\times p$) by matrix $\mathbf{B}$ ($p\times n$), the inner dimensions must match. Here, both equal $p$. The result $\mathbf{C} = \mathbf{A}\mathbf{B}$ is $m\times n$, and each entry is: $$ C_{ij} = \sum_{l=1}^{p} A_{il}\, B_{lj}. $$ That is, the $(i,j)$ entry of $\mathbf{C}$ is the dot product of the $i$-th row of $\mathbf{A}$ with the $j$-th column of $\mathbf{B}$.
The two products we will meet repeatedly are:
Note: these products should feel familiar. In the bivariate case, the OLS formula is built from exactly such sums.
The example below uses a design matrix with 4 observations and 3 columns (an intercept column of ones, plus two regressors). You can choose between $\mathbf{X}^\top\mathbf{X}$ and $\mathbf{X}^\top\mathbf{y}$, then step through to watch each entry of the result computed as a dot product. The active row of $\mathbf{X}^\top$ is highlighted in yellow; the active column of $\mathbf{X}$ (or $\mathbf{y}$) in blue.
(To be continued.)