Simulation

Calculus, Probability, and Statistics Primers, Continued

88 minute read



Notice a tyop typo? Please submit an issue or open a PR.

Calculus, Probability, and Statistics Primers, Continued

Conditional Expectation (OPTIONAL)

In this lesson, we are going to continue our exploration of conditional expectation and look at several cool applications. This lesson will likely be the toughest one for the foreseeable future, but don't panic!

Conditional Expectation Recap

Let's revisit the conditional expectation of YY given X=xX = x. The definition of this expectation is as follows:

E[YX=x]={yyf(yx)discreteRyf(yx)dycontinuousE[Y|X = x] = \left\{\begin{matrix} \sum_y y f(y|x) & \text{discrete} \\ \int_{\mathbb{R}} y f(y|x)dy & \text{continuous} \end{matrix}\right.

For example, suppose f(x,y) 21x2y/4f(x,y) \ 21x^2y / 4 for x2y1x^2 \leq y \leq 1. Then, by definition:

f(yx)=f(x,y)fX(x)f(y|x) = \frac{f(x,y)}{f_X(x)}

We calculated the marginal pdf, fX(x)f_X(x), previously, as the integral of f(x,y)f(x,y) over all possible values of y[x2,1]y \in [x^2, 1]. We can plug in fX(x)f_X(x) and f(x,y)f(x, y) below:

f(yx)=214x2y218(1x4)=2y1x4,x2y1f(y|x) = \frac{\frac{21}{4}x^2y}{\frac{21}{8}(1- x^4)} = \frac{2y}{1 - x^4}, \quad x^2 \leq y \leq 1

Given f(yx)f(y|x), we can now compute E[YX=x]E[Y | X = x]:

E[YX=x]=Ry(2y1x4)dyE[Y | X = x] = \int_{\mathbb{R}} y * \left(\frac{2y}{1 - x^4}\right)dy

We adjust the limits of integration to match the limits of yy:

E[YX=x]=x21y(2y1x4)dyE[Y | X = x] = \int_{x^2}^1 y * \left(\frac{2y}{1 - x^4}\right)dy

Now, complete the integration:

E[YX=x]=x212y21x4dyE[Y | X = x] = \int_{x^2}^1 \frac{2y^2}{1 - x^4}dy
E[YX=x]=21x4x21y2dyE[Y | X = x] = \frac{2}{1 - x^4} \int_{x^2}^1 y^2dy
E[YX=x]=21x4y33x21E[Y | X = x] = \frac{2}{1 - x^4} \frac{y^3}{3}\Big|^1_{x^2}
E[YX=x]=23(1x4)y3x21=2(1x6)3(1x4)E[Y | X = x] = \frac{2}{3(1 - x^4)} y^3\Big|^1_{x^2} = \frac{2(1 - x^6)}{3(1 - x^4)}

Double Expectations

We just looked at the expected value of YY given a particular value X=xX = x. Now we are going to average the expected value of YY over all values of XX. In other words, we are going to take the average expected value of all the conditional expected values, which will give us the overall population average for YY.

The theorem of double expectations states that the expected value of the expected value of YY given XX is the expected value of YY. In other words:

E[E(YX)]=E[Y]E[E(Y|X)] = E[Y]

Let's look at E[YX]E[Y|X]. We can use the formula that we used to calculate E[YX=x]E[Y|X=x] to find E[YX]E[Y|X], replacing xx with XX. Let's look back at our conditional expectation from the previous slide:

E[YX=x]=2(1x6)3(1x4)E[Y | X = x] = \frac{2(1 - x^6)}{3(1 - x^4)}

If we set X=XX = X, we get the following expression:

E[YX=X]=E[YX]=2(1X6)3(1X4)E[Y | X = X] = E[Y | X] = \frac{2(1 - X^6)}{3(1 - X^4)}

What does this mean? E[YX]E[Y|X] is itself a random variable that is a function of the random variable XX. Let's call this function hh:

h(X)=2(1X6)3(1X4)h(X) = \frac{2(1 - X^6)}{3(1 - X^4)}

We now have to calculate E[h(X)]E[h(X)], which we can accomplish using the definition of LOTUS:

E[h(X)]=Rh(x)fX(x)dxE[h(X)] = \int_{\mathbb{R}} h(x)f_X(x)dx

Let's substitute in for h(x)h(x) and h(X)h(X):

E[E[YX]]=RE(Yx)fX(x)dxE[E[Y|X]] = \int_{\mathbb{R}} E(Y|x)f_X(x)dx

Remember the definition for E[YX=x]E[Y|X = x]:

E[YX=x]={yyf(yx)discreteRyf(yx)dycontinuousE[Y|X = x] = \left\{\begin{matrix} \sum_y y f(y|x) & \text{discrete} \\ \int_{\mathbb{R}} y f(y|x)dy & \text{continuous} \end{matrix}\right.

Thus:

E[E[YX]]=R(Ryf(yx)dy)fX(x)dxE[E[Y|X]] = \int_{\mathbb{R}} \left(\int_{\mathbb{R}} y f(y|x)dy\right)f_X(x)dx

We can rearrange the right-hand side. Note that we can move yy outside of the first integral since it is a constant value when we integrate with respect to dxdx:

E[E[YX]]=RRyf(yx)fX(x)dxdy=RyRf(yx)fX(x)dxdyE[E[Y|X]] = \int_{\mathbb{R}} \int_{\mathbb{R}} y f(y|x)f_X(x)dx dy = \int_{\mathbb{R}} y \int_{\mathbb{R}} f(y|x)f_X(x)dx dy

Remember now the definition for the conditional pdf:

f(yx)=f(x,y)fX(x);f(yx)fX(x)=f(x,y)f(y|x) = \frac{f(x,y)}{f_X(x)}; \quad f(y|x)f_X(x) = f(x, y)

We can substitute in f(x,y)f(x,y) for f(yx)fX(x)f(y|x)f_X(x):

E[E[YX]]=RyRf(x,y)dxdyE[E[Y|X]] = \int_{\mathbb{R}} y \int_{\mathbb{R}} f(x,y)dx dy

Let's remember the definition for the marginal pdf of YY:

fY(y)=Rf(x,y)dxf_Y(y) = \int_{\mathbb{R}} f(x,y)dx

Let's substitute:

E[E[YX]]=RyfY(y)dyE[E[Y|X]] = \int_{\mathbb{R}} y f_Y(y)dy

Of course, the expected value of YY, E[Y]E[Y] equals:

E[Y]=RyfY(y)dyE[Y] = \int_{\mathbb{R}} y f_Y(y) dy

Thus:

E[E[YX]]=E[Y]E[E[Y|X]] = E[Y]

Example

Let's apply this theorem using our favorite joint pdf: f(x,y)=21x2y/4,x2y1f(x,y) = 21x^2y / 4, x^2 \leq y \leq 1. Through previous examples, we know fX(x)f_X(x), fY(y)f_Y(y) and E[Yx]E[Y|x]:

fX(x)=218x2(1x4)f_X(x) = \frac{21}{8}x^2(1-x^4)
fY(y)=72y52f_Y(y) = \frac{7}{2}y^{\frac{5}{2}}
E[Yx]=2(1x6)3(1x4)E[Y|x] = \frac{2(1 - x^6)}{3(1 - x^4)}

We are going to look at two ways to compute E[Y]E[Y]. First, we can just use the definition of expected value and integrate the product yFY(y)dyyF_Y(y)dy over the real line:

E[Y]=RyfY(y)dyE[Y] = \int_{\mathbb{R}} y f_Y(y)dy
E[Y]=01y72y52dyE[Y] = \int_0^1 y * \frac{7}{2}y^{\frac{5}{2}} dy
E[Y]=0172y72dyE[Y] = \int_0^1 \frac{7}{2}y^{\frac{7}{2}} dy
E[Y]=7201y72dyE[Y] = \frac{7}{2} \int_0^1 y^{\frac{7}{2}} dy
E[Y]=7229y9201=79y9201=79E[Y] = \frac{7}{2} \frac{2}{9}y^\frac{9}{2}\Big|_0^1 = \frac{7}{9} y^\frac{9}{2}\Big|_0^1 = \frac{7}{9}

Now, let's calculate E[Y]E[Y] using the double expectation theorem we just learned:

E[Y]=E[E(YX)]=RE(Yx)fX(x)dxE[Y] = E[E(Y|X)] = \int_{\mathbb{R}} E(Y|x) f_X(x)dx
E[Y]=112(1x6)3(1x4)×218x2(1x4)dxE[Y] = \int_{-1}^1 \frac{2(1 - x^6)}{3(1 - x^4)} \times \frac{21}{8}x^2(1-x^4) dx
E[Y]=422411(1x6)(1x4)×x2(1x4)dxE[Y] = \frac{42}{24}\int_{-1}^1 \frac{(1 - x^6)}{(1 - x^4)} \times x^2(1-x^4) dx
E[Y]=422411x2(1x6)dxE[Y] = \frac{42}{24}\int_{-1}^1 x^2(1 - x^6) dx
E[Y]=422411x2x8dxE[Y] = \frac{42}{24}\int_{-1}^1 x^2 - x^8 dx
E[Y]=4224(x33x99)11E[Y] = \frac{42}{24} \left(\frac{x^3}{3} - \frac{x^9}{9}\right) \Big|_{-1}^1
E[Y]=4224(3x3x99)11E[Y] = \frac{42}{24} \left(\frac{3x^3 - x^9}{9} \right) \Big|_{-1}^1
E[Y]=4224(31(3+1)9)=422449=79E[Y] = \frac{42}{24} \left(\frac{3 - 1 - (-3+1)}{9} \right) = \frac{42}{24} * \frac{4}{9} = \frac{7}{9}

Mean of the Geometric Distribution

In this application, we are going to see how we can use double expectation to calculate the mean of a geometric distribution.

Let YY equal the number of coin flips before a head, HH, appears, where P(H)=pP(H) = p. Thus, YY is distributed as a geometric random variable parameterized by pp: YGeom(p)Y \sim \text{Geom}(p). We know that the pmf of YY is fY(y)=P(Y=y)=(1p)y1p,y=1,2,...f_Y(y) = P(Y = y) = (1-p)^{y-1}p, y = 1,2,.... In other words, P(Y=y)P(Y = y) is the product of the probability of y1y-1 failures and the probability of one success.

Let's calculate the expected value of YY using the summation equation we've used previously (take the result on faith):

E[Y]=yyfY(y)=1y(1p)y1p=1pE[Y] = \sum_y y f_Y(y) = \sum_1^\infty y(1-p)^{y-1}p = \frac{1}{p}

Now we are going to use double expectation and a standard one-step conditioning argument to compute E[Y]E[Y]. First, let's define X=1X = 1 if the first flip is HH and X=0X = 0 otherwise. Let's pretend that we have knowledge of the first flip. We don't really have this knowledge, but we do know that the first flip can either be heads or tails: P(X=1)=p,P(X=0)=1pP(X = 1) = p, P(X = 0) = 1 - p.

Let's remember the double expectation formula:

E[Y]=E[E(YX)]=xE(Yx)fX(x)E[Y] = E[E(Y|X)] = \sum_x E(Y|x)f_X(x)

What are the xx-values? XX can only equal 00 or 11, so:

E[Y]=E(YX=0)P(X=0)+E(YX=1)P(X=1)E[Y] = E(Y|X = 0)P(X = 0) + E(Y|X=1)P(X=1)

Now, if X=0X= 0, the first flip was tails, and I have to start counting all over again. The expected number of flips I have to make before I see heads is E[Y]E[Y]. However, I have already flipped once, and I flipped tails: that's what X=0X = 0 means. So, the expected number of flips I need, given that I already flipped tails is 1+E[Y]1 + E[Y]: P(YX=0)=1+E[Y]P(Y|X=0) = 1 + E[Y] What is P(0)P(0)? It's just 1p1 - p. Thus:

E[YX=0]P(X=0)=(1+E[Y])(1p)E[Y|X = 0]P(X = 0) = (1 + E[Y])(1 - p)

Now, if X=1X = 1, the first flip was heads. I won! Given that X=1X = 1, the expected value of YY is one. If I know that I flipped heads on the first try, the expected number of trials before I flip heads is that one trial: P(YX=1)=1P(Y|X=1) = 1. What is P(1)P(1)? It's just pp. Thus:

E[YX=1]P(X=1)=(1)(p)=pE[Y|X = 1]P(X = 1) = (1)(p) = p

Let's solve for E[Y]E[Y]:

E[Y]=(1+E[Y])(1p)+pE[Y] = (1 + E[Y])(1 - p) + p
E[Y]=1+E[Y]ppE[Y]+pE[Y] = 1 + E[Y] -p -pE[Y] + p
E[Y]=1+E[Y]pE[Y]E[Y] = 1 + E[Y] - pE[Y]
pE[Y]=1;E[Y]=1ppE[Y] = 1; \quad E[Y] = \frac{1}{p}

Computing Probabilities by Conditioning

Let AA be some event. We define the random variable Y=1Y=1 if AA occurs, and Y=0Y = 0 otherwise. We refer to YY as an indicator function of AA; that is, the value of YY indicates the occurrence of AA. The expected value of YY is given by:

E[Y]=yyfY(y)dyE[Y] = \sum_y y f_Y(y)dy

Let's enumerate the yy-values:

E[Y]=0(P(Y=0))+1(P(Y=1))=P(Y=1)E[Y] = 0(P(Y = 0)) + 1(P(Y = 1)) = P(Y = 1)

What is P(Y=1)P(Y = 1)? Well, Y=1Y = 1 when AA occurs, so P(Y=1)=P(A)=E[Y]P(Y = 1) = P(A) = E[Y]. Indeed, the expected value of an indicator function is the probability of the corresponding event.

Similarly, for any random variable, XX, we have:

E[YX=x]=yyfY(yx)E[Y | X = x] = \sum_y y f_Y(y|x)

If we enumerate the yy-values, we have:

E[YX=x]=0(fY(Y=0X=x))+1(fY(Y=1X=x))=fY(Y=1X=x)\begin{alignedat}{1} E[Y | X = x] & = 0(f_Y(Y = 0|X= x)) + 1(f_Y(Y = 1|X = x)) \\[2ex] & = f_Y(Y = 1|X = x) \end{alignedat}

Since we know that f(Y=1)=P(A)f(Y = 1) = P(A), then:

E[Y=1X=x]=P(AX=1)E[Y = 1 | X = x] = P(A|X = 1)

Let's look at an implication of the above result. By definition:

P[A]=E[Y]=E[E(YX)]P[A] = E[Y] = E[E(Y | X)]

Using LOTUS:

P[A]=RE[YX=x]dFX(x)P[A] = \int_{\mathbb{R}} E[Y|X=x]dF_X(x)

Since we saw that E[YX=x]=P(AX=x)E[Y|X=x] = P(A|X=x), then:

P[A]=RP(AX=x)dFX(x)P[A] = \int_{\mathbb{R}} P(A|X=x)dF_X(x)

Theorem

The result above implies that, if XX and YY are independent, continuous random variables, then:

P(Y<X)=RP(Y<x)fX(x)dxP(Y < X) = \int_{\mathbb{R}} P(Y < x)f_X(x)dx

To prove, let A={Y<X}A = \{Y < X\}. Then:

P[A]=RP(AX=x)dFX(x)P[A] = \int_{\mathbb{R}} P(A|X=x)dF_X(x)

Substitute A={Y<X}A = \{Y < X\}:

P[A]=RP(Y<XX=x)dFX(x)P[A] = \int_{\mathbb{R}} P(Y < X|X=x)dF_X(x)

What's P(Y<XX=x)P(Y < X|X=x)? In other words, for a given X=xX = x, what's the probability that Y<XY < X? That's a long way of saying P(Y<x)P(Y < x):

P[A]=RP(Y<x)dFX(x)P[A] = \int_{\mathbb{R}} P(Y < x)dF_X(x)
P[A]=P[Y<X]=RP(Y<x)fX(x)dx,Fx(x)=fX(x)dxP[A] = P[Y < X] = \int_{\mathbb{R}} P(Y < x)f_X(x)dx, \quad F_x'(x) = f_X(x)dx

Example

Suppose we have two random variables, XExp(μ)X \sim \text{Exp}(\mu) and YExp(λ)Y \sim \text{Exp}(\lambda). Then:

P(Y<X)=RP(Y<x)fX(x)dxP(Y < X) = \int_{\mathbb{R}} P(Y < x)f_X(x)dx

Note that P(Y<x)P(Y < x) is the cdf of YY at xx: FY(x)F_Y(x). Thus:

P(Y<X)=RFY(x)fX(x)dxP(Y < X) = \int_{\mathbb{R}} F_Y(x)f_X(x)dx

Since XX and YY are both exponentially distributed, we know that they have the following pdf and cdf, by definition:

f(x;λ)=λeλx,x0f(x; \lambda) = \lambda e^{-\lambda x}, x \geq 0
F(x;λ)=1eλx,x0F(x; \lambda) = 1 - e^{-\lambda x}, x \geq 0

Let's substitute these values in, adjusting the limits of integration appropriately:

P(Y<X)=01eλx(μeμx)dxP(Y < X) = \int_0^\infty 1 - e^{-\lambda x}(\mu e^{-\mu x})dx

Let's rearrange:

P(Y<X)=μ0eμxeλxμxdxP(Y < X) = \mu \int_0^\infty e^{-\mu x} - e^{-\lambda x - \mu x} dx
P(Y<X)=μ[0eμxdx0eλxμxdx]P(Y < X) = \mu \left[\int_0^\infty e^{-\mu x} dx - \int_0^\infty e^{-\lambda x - \mu x} dx\right]

Let u1=μxu_1 = -\mu x. Then du1=μdxdu_1 = -\mu dx. Let u2=λxμxu_2 = -\lambda x - \mu x. Then du2=(λ+μ)dxdu_2 = -(\lambda + \mu)dx. Thus:

P(Y<X)=μ[0eu1μdu1+0eu2λ+μdu2]P(Y < X) = \mu \left[-\int_0^\infty \frac{e^{u_1}}{\mu} du_1 + \int_0^\infty \frac{e^{u_2}}{\lambda + \mu} du_2\right]

Now we can integrate:

P(Y<X)=μ[0eu2λ+μdu20eu1μdu1]P(Y < X) = \mu \left[\int_0^\infty \frac{e^{u_2}}{\lambda + \mu} du_2 - \int_0^\infty \frac{e^{u_1} }{\mu}du_1 \right]
P(Y<X)=μ[eu2λ+μeu1μ]0P(Y < X) = \mu \left[\frac{e^{u_2}}{\lambda + \mu} - \frac{e^{u_1}}{\mu} \right]_0^\infty
P(Y<X)=μ[eλxμxλ+μeμxμ]0P(Y < X) = \mu \left[\frac{e^{-\lambda x - \mu x}}{\lambda + \mu} - \frac{e^{-\mu x}}{\mu} \right]_0^\infty
P(Y<X)=μ[01λ+μ+1μ]P(Y < X) = \mu \left[0 - \frac{1}{\lambda + \mu} + \frac{1}{\mu} \right]
P(Y<X)=μ[1μ1λ+μ]P(Y < X) = \mu \left[\frac{1}{\mu} - \frac{1}{\lambda + \mu} \right]
P(Y<X)=μμμλ+μP(Y < X) = \frac{\mu}{\mu} - \frac{\mu}{\lambda + \mu}
P(Y<X)=λ+μλ+μμλ+μ=λλ+μP(Y < X) = \frac{\lambda + \mu}{\lambda + \mu} - \frac{\mu}{\lambda + \mu} = \frac{\lambda}{\lambda + \mu}

As it turns out, this result makes sense because XX and YY correspond to arrivals from a poisson process and μ\mu and λ\lambda are the arrival rates. For example, suppose that XX corresponds to arrival times for women to a store, and YY corresponds to arrival times for men. If women are coming in at a rate of three per hour - λ=3\lambda = 3 - and men are coming in at a rate of nine per hour - μ=9\mu = 9 - then the probability of a woman arriving before a man is going to be 3/43/4.

Variance Decomposition

Just as we can use double expectation for the expected value of YY, we can express the variance of YY, Var(Y)\text{Var}(Y) in a similar fashion, which we refer to as variance decomposition:

Var(Y)=E[Var(YX)]+Var[E(YX)]\text{Var}(Y) = E[\text{Var}(Y|X)] + \text{Var}[E(Y|X)]

Proof

Let's start with the first term: E[Var(YX)]E[\text{Var}(Y|X)]. Remember the definition of variance, as the second central moment:

Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2

Thus, we can express E[Var(YX)]E[\text{Var}(Y|X)] as:

E[Var(YX)]=E[E[Y2X](E[YX])2]E[\text{Var}(Y|X)] = E[E[Y^2 | X] - (E[Y|X])^2]

Note that, since expectation is linear:

E[Var(YX)]=E[E[Y2X]]E[(E[YX])2]E[\text{Var}(Y|X)] = E[E[Y^2 | X]] - E[(E[Y|X])^2]

Notice the first expression on the right-hand side. That's a double expectation, and we know how to simplify that:

E[Var(YX)]=E[Y2]E[(E[YX])2],1.E[\text{Var}(Y|X)] = E[Y^2] - E[(E[Y|X])^2], \quad 1.

Now let's look at the second term in the variance decomposition: Var[E(YX)]\text{Var}[E(Y|X)]. Considering again the definition for variance above, we can transform this term:

Var[E(YX)]=E[(E[YX)2](E[E[YX]])2\text{Var}[E(Y|X)] = E[(E[Y | X)^2] - (E[E[Y|X]])^2

In this equation, we again see a double expectation, quantity squared. So:

Var[E(YX)]=E[(E[YX)2]E[Y]2,2.\text{Var}[E(Y|X)] = E[(E[Y| X)^2] - E[Y]^2, \quad 2.

Remember the equation for variance decomposition:

Var(Y)=E[Var(YX)]+Var[E(YX)]\text{Var}(Y) = E[\text{Var}(Y|X)] + \text{Var}[E(Y|X)]

Let's plug in 11 and 22 for the first and second term, respectively:

Var(Y)=E[Y2]E[(E[YX])2]+E[(E[YX)2]E[Y]2\text{Var}(Y) =E[Y^2] - E[(E[Y|X])^2] + E[(E[Y | X)^2] - E[Y]^2

Notice the cancellation of the two scary inner terms to reveal the definition for variance:

Var(Y)=E[Y2]E[Y]2=Var(Y)\text{Var}(Y) = E[Y^2] - E[Y]^2 = \text{Var}(Y)

Covariance and Correlation

In this lesson, we are going to talk about independence, covariance, correlation, and some related results. Correlation shows up all over the place in simulation, from inputs to outputs to everywhere in between.

LOTUS in 2D

Suppose that h(X,Y)h(X,Y) is some function of two random variables, XX and YY. Then, via LOTUS, we know how to calculate the expected value, E[h(X,Y)]E[h(X,Y)]:

E[h(X,Y)]={xyh(x,y)f(x,y)if (X,Y) is discreteRRh(x,y)f(x,y)dxdyif (X,Y) is continuousE[h(X,Y)] = \left\{\begin{matrix} \sum_x \sum_y h(x,y)f(x,y) & \text{if (X,Y) is discrete} \\ \int_{\mathbb{R}} \int_{\mathbb{R}} h(x,y)f(x,y)dx dy & \text{if (X,Y) is continuous} \\ \end{matrix}\right.

Expected Value, Variance of Sum

Whether or not XX and YY are independent, the sum of the expected values equals the expected value of the sum:

E[X+Y]=E[X]+E[Y]E[X+Y] = E[X] + E[Y]

If XX and YY are independent, then the sum of the variances equals the variance of the sum:

Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)

Note that we need the equations for LOTUS in two dimensions to prove both of these theorems.

Aside: I tried to prove these theorems. It went terribly! Check out the proper proofs here.

Random Sample

Let's suppose we have a set of nn random variables: X1,...,XnX_1,...,X_n. This set is said to form a random sample from the pmf/pdf f(x)f(x) if all the variables are (i) independent and (ii) each XiX_i has the same pdf/pmf f(x)f(x).

We can use the following notation to refer to such a random sample:

X1,...,Xniidf(x)X_1,...,X_n \overset{\text{iid}}{\sim} f(x)

Note that "iid" means "independent and identically distributed", which is what (i) and (ii) mean, respectively, in our definition above.

Theorem

Given a random sample, X1,...,Xniidf(x)X_1,...,X_n \overset{\text{iid}}{\sim} f(x), the sample mean, Xnˉ\bar{X_n} equals the following:

Xnˉi=1nXin\bar{X_n} \equiv \sum_{i =1}^n \frac{X_i}{n}

Given the sample mean, the expected value of the sample mean is the expected value of any of the individual variables, and the variance of the sample mean is the variance of any of the individual variables divided by nn:

E[Xnˉ]=E[Xi];Var(Xnˉ)=Var(Xi)/nE[\bar{X_n}] =E[X_i]; \quad\text{Var}(\bar{X_n}) = \text{Var}(X_i) / n

We can observe that as nn increases, E[Xnˉ]E[\bar{X_n}] is unaffected, but Var(Xnˉ)\text{Var}(\bar{X_n}) decreases.

Covariance

Covariance is one of the most fundamental measures of non-independence between two random variables. The covariance between XX and YY, Cov(X,Y)\text{Cov}(X, Y) is defined as:

Cov(X,Y)E[(XE[X])(YE[Y])]\text{Cov}(X,Y) \equiv E[(X-E[X])(Y - E[Y])]

The right-hand side of this equation looks daunting, so let's see if we can simplify it. We can first expand the product:

E[(XE[X])(YE[Y])=E[XYXE[Y]YE[X]+E[Y]E[X]]\begin{alignedat}{1} & E[(X-E[X])(Y - E[Y]) = \\ & E[XY - XE[Y] - YE[X] + E[Y]E[X]] \end{alignedat}

Since expectation is linear, we can rewrite the right-hand side as a difference of expected values:

E[(XE[X])(YE[Y])=E[XY]E[XE[Y]]E[YE[X]]+E[E[Y]E[X]]\begin{alignedat}{1} & E[(X-E[X])(Y - E[Y]) = \\ & E[XY] - E[XE[Y]] - E[YE[X]] + E[E[Y]E[X]] \end{alignedat}

Note that both E[X]E[X] and E[Y]E[Y] are just numbers: the expected values of the corresponding random variables. As a result, we can apply two principles here: E[aX]=aE[X]E[aX] = aE[X] and E[a]=aE[a] = a. Consider the following rearrangement:

E[(XE[X])(YE[Y])=E[XY]E[Y]E[X]E[X]E[Y]+E[Y]E[X]\begin{alignedat}{1} & E[(X-E[X])(Y - E[Y]) = \\ & E[XY] - E[Y]E[X] - E[X]E[Y] + E[Y]E[X] \end{alignedat}

The last three terms are the same, they and sum to E[Y]E[X]-E[Y]E[X]. Thus:

Cov(X,Y)E[(XE[X])(YE[Y])]=E[XY]E[Y]E[X]\begin{alignedat}{1} \text{Cov}(X,Y) & \equiv E[(X-E[X])(Y - E[Y])] \\[2ex] & = E[XY] - E[Y]E[X] \end{alignedat}

This equation is much easier to work with; namely, h(X,Y)=XYh(X,Y) = XY is a much simpler function than h(X,Y)=(XE[X])(YE[Y])h(X,Y) = (X-E[X])(Y - E[Y]) when it comes time to apply LOTUS.

Let's understand what happens when we take the covariance of XX with itself:

Cov(X,X)=E[XX]E[X]E[X]=E[X2](E[X])2=Var(X)\begin{alignedat}{1} \text{Cov}(X,X) & = E[X * X] - E[X]E[X] \\[2ex] & = E[X^2] - (E[X])^2 \\[2ex] & = \text{Var}(X) \end{alignedat}

Theorem

If XX and YY are independent random variables, then Cov(X,Y)=0\text{Cov}(X, Y) = 0. On the other hand, a covariance of 00 does not mean that XX and YY are independent.

For example, consider two random variables, XUnif(1,1)X \sim \text{Unif}(-1,1) and Y=X2Y = X^2. Since YY is a function of XX, the two random variables are dependent: if you know XX, you know YY. However, take a look at the covariance:

Cov(X,Y)=E[X3]E[X]E[X2]\text{Cov}(X, Y) = E[X^3] - E[X]E[X^2]

What is E[X]E[X]? Well, we can integrate the pdf from 1-1 to 11, or we can understand that the expected value of a uniform random variable is the average of the bounds of the distribution. That's a long way of saying that E[X]=(1+1)/2=0E[X] =(-1 + 1) / 2 = 0.

Now, what is E[X3]E[X^3]? We can apply LOTUS:

E[X3]=11x3f(x)dxE[X^3] = \int_{-1}^1 x^3f(x)dx

What is the pdf of a uniform random variable? By definition, it's one over the difference of the bounds:

E[X3]=11111x3f(x)dxE[X^3] = \frac{1}{1 - - 1}\int_{-1}^1 x^3f(x)dx

Let's integrate and evaluate:

E[X3]=12x4411=148(1)48=0E[X^3] = \frac{1}{2} \frac{x^4}{4}\Big|_{-1}^1 = \frac{1^4}{8} - \frac{(-1)^4}{8} = 0

Thus:

Cov(X,Y)=E[X3]E[X]E[X2]=0\text{Cov}(X, Y) = E[X^3] - E[X]E[X^2] = 0

Just because the covariance between XX and YY is 00 does not mean that they are independent!

More Theorems

Suppose that we have two random variables, XX and YY, as well as two constants, aa and bb. We have the following theorem:

Cov(aX,bY)=abCov(X,Y)\text{Cov}(aX, bY) = ab\text{Cov}(X,Y)

Whether or not XX and YY are independent,

Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)
Var(XY)=Var(X)+Var(Y)2Cov(X,Y)\text{Var}(X - Y) = \text{Var}(X) + \text{Var}(Y) - 2\text{Cov}(X, Y)

Note that we looked at a theorem previously which gave an equation for the variance of X+YX + Y when both variables are independent: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y). That equation was a special case of the theorem above, where Cov(X,Y)=0\text{Cov}(X,Y) = 0 as is the case between two independent random variables.

Correlation

The correlation between XX and YY, ρ\rho, is equal to:

ρCov(X,Y)Var(X)Var(Y)\rho \equiv \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}}

Note that correlation is standardized covariance. In other words, for any XX and YY, 1ρ1-1 \leq \rho \leq 1.

If two variables are highly correlated, then ρ\rho will be close to 11. If two variables are highly negatively correlated, then ρ\rho will be close to 1-1. Two variables with low correlation will have a ρ\rho close to 00.

Example

Consider the following joint pmf:

f(x,y)X=2X=3X=4fY(y)Y=400.000.200.100.3Y=500.150.100.050.3Y=600.300.000.100.4fX(x)0.450.300.251\begin{array}{c|ccc|c} f(x,y) & X = 2 & X = 3 & X = 4 & f_Y(y) \\ \hline Y = 40 & 0.00 & 0.20 & 0.10 & 0.3 \\ Y = 50 & 0.15 & 0.10 & 0.05 & 0.3 \\ Y = 60 & 0.30 & 0.00 & 0.10 & 0.4 \\ \hline f_X(x) & 0.45 & 0.30 & 0.25 & 1 \\ \end{array}

For this pmf, XX can take values in {2,3,4}\{2, 3, 4\} and YY can take values in {40,50,60}\{40, 50, 60\}. Note the marginal pmfs along the table's right and bottom, and remember that all pmfs sum to one when calculated over all appropriate values.

What is the expected value of XX? Let's use fX(x)f_X(x):

E[X]=2(0.45)+3(0.3)+4(0.25)=2.8E[X] = 2(0.45) + 3(0.3) + 4(0.25) = 2.8

Now let's calculate the variance:

Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2
Var(X)=4(0.45)+9(0.3)+16(0.25)(2.8)2=0.66\text{Var}(X) = 4(0.45) + 9(0.3) + 16(0.25) - (2.8)^2 = 0.66

What is the expected value of YY? Let's use fY(y)f_Y(y):

E[Y]=40(0.3)+50(0.3)+60(0.4)=51E[Y] = 40(0.3) + 50(0.3) + 60(0.4) = 51

Now let's calculate the variance:

Var(Y)=E[Y2](E[Y])2\text{Var}(Y) = E[Y^2] - (E[Y])^2
Var(X)=1600(0.3)+2500(0.3)+3600(0.4)(51)2=69\text{Var}(X) = 1600(0.3) + 2500(0.3) + 3600(0.4) - (51)^2 = 69

If we want to calculate the covariance of XX and YY, we need to know E[XY]E[XY], which we can calculate using two-dimensional LOTUS:

E[XY]=xyxyf(x,y)E[XY] = \sum_x \sum_y xy f(x,y)
E[XY]=(2400.00)+(2500.15)+...+(4600.1)=140E[XY] = (2 * 40 * 0.00) + (2 * 50 * 0.15) + ... + (4 * 60 * 0.1) = 140

With E[XY]E[XY] in hand, we can calculate the covariance of XX and YY:

Cov(X,Y)=E[XY]E[X]E[Y]=140(2.851)=2.8\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = 140 - (2.8 * 51) = -2.8

Finally, we can calculate the correlation:

ρ=Cov(X,Y)Var(X)Var(Y)\rho = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}}
ρ=2.80.66(69)0.415\rho = \frac{-2.8}{\sqrt{0.66(69)}} \approx -0.415

Portfolio Example

Let's look at two different assets, S1S_1 and S2S_2, that we hold in our portfolio. The expected yearly returns of the assets are E[S1]=μ1E[S_1] = \mu_1 and E[S2]=μ2E[S_2] = \mu_2, and the variances are Var(S1)=σ12\text{Var}(S_1) = \sigma_1^2 and Var(S2)=σ22\text{Var}(S_2) = \sigma_2^2. The covariance between the assets is σ12\sigma_{12}.

A portfolio is just a weighted combination of assets, and we can define our portfolio, PP, as:

P=wS1+(1w)S2,w[0,1]P = wS_1 + (1 - w)S_2, \quad w \in [0,1]

The portfolio's expected value is the sum of the expected values of the assets times their corresponding weights:

E[P]=E[wS1+(1w)S2]E[P] = E[wS_1 + (1 - w)S_2]
E[P]=E[wS1]+E[(1w)S2]E[P] = E[wS_1] + E[(1 - w)S_2]
E[P]=wE[S1]+(1w)E[S2]E[P] = wE[S_1] + (1 - w)E[S_2]
E[P]=wμ1+(1w)μ2E[P] = w\mu_1 + (1-w)\mu_2

Let's calculate the variance of the portfolio:

Var(P)=Var(wS1+(1w)S2)\text{Var}(P) = \text{Var}(wS_1 + (1-w)S_2)

Remember how we express Var(X+Y)\text{Var}(X + Y):

Var(P)=Var(wS1)+Var((1w)S2)+2Cov(wS1,(1w)S2)\text{Var}(P) = \text{Var}(wS_1) + \text{Var}((1-w)S_2) + 2\text{Cov}(wS_1, (1-w)S_2)

Remember that Var(aX)=a2Var(X)\text{Var}(aX) = a^2\text{Var}(X) and Cov(aX,bY)=abCov(X,Y)\text{Cov}(aX, bY) = ab\text{Cov}(X,Y). Thus:

Var(P)=w2Var(S1)+(1w)2Var(S2)+2w(1w)Cov(S1,S2)\text{Var}(P) = w^2\text{Var}(S_1) + (1-w)^2\text{Var}(S_2) + 2w(1-w)\text{Cov}(S_1, S_2)

Finally, let's substitute in the appropriate variables:

Var(P)=w2σ12+(1w)2σ22+2w(1w)σ12\text{Var}(P) = w^2\sigma^2_1 + (1-w)^2\sigma^2_2 + 2w(1-w)\sigma_{12}

How might we optimize this portfolio? One thing we might want to optimize for is minimal variance: many people want their portfolios to have as little volatility as possible.

Let's recap. Given a function f(x)f(x), how do we find the xx that minimizes f(x)f(x)? We can take the derivative, f(x)f'(x), set it to 00 and then solve for xx. Let's apply this logic to Var(P)\text{Var}(P). First, we take the derivative with respect to ww:

ddwVar(P)=2wσ122(1w)σ22+2σ124wσ12\frac{d}{dw}\text{Var}(P) = 2w\sigma^2_1 - 2(1-w)\sigma^2_2 + 2\sigma_{12} - 4w\sigma_{12}
ddwVar(P)=2wσ122σ22+2wσ22+2σ124wσ12\frac{d}{dw}\text{Var}(P) = 2w\sigma^2_1 - 2\sigma^2_2 +2w\sigma^2_2 + 2\sigma_{12} - 4w\sigma_{12}

Then, we set the derivative equal to 00 and solve for ww:

0=2wσ122σ22+2wσ22+2σ124wσ120 = 2w\sigma^2_1 - 2\sigma^2_2 +2w\sigma^2_2 + 2\sigma_{12} - 4w\sigma_{12}
0=wσ12σ22+wσ22+σ122wσ120 = w\sigma^2_1 - \sigma^2_2 +w\sigma^2_2 + \sigma_{12} - 2w\sigma_{12}
σ22σ12=wσ12+wσ222wσ12\sigma^2_2 - \sigma_{12} = w\sigma^2_1 +w\sigma^2_2 - 2w\sigma_{12}
σ22σ12=w(σ12+σ222σ12)\sigma^2_2 - \sigma_{12} = w(\sigma^2_1 +\sigma^2_2 - 2\sigma_{12})
σ22σ12σ12+σ222σ12=w\frac{\sigma^2_2 - \sigma_{12}}{\sigma^2_1 +\sigma^2_2 - 2\sigma_{12}} = w

Example

Suppose E[S1]=0.2E[S_1] = 0.2, E[S2]=0.1E[S_2] = 0.1, Var(S1)=0.2\text{Var}(S_1) = 0.2, Var(S2)=0.4\text{Var}(S_2) = 0.4, and Cov(S1,S2)=0.1\text{Cov}(S_1, S_2) = -0.1.

What value of ww maximizes the expected return of this portfolio? We don't even have to do any math: just allocate 100% of the portfolio to the asset with the higher expected return - S1S_1. Since we define our portfolio as wS1+(1w)S2wS_1 + (1 - w)S_2, the correct value for ww is 11.

What value of ww minimizes the variance? Let's plug and chug:

w=σ22σ12σ12+σ222σ12w = \frac{\sigma^2_2 - \sigma_{12}}{\sigma^2_1 +\sigma^2_2 - 2\sigma_{12}}
w=0.4+0.10.2+0.4+0.2=0.5/0.8=0.625w = \frac{0.4 + 0.1}{0.2 + 0.4 + 0.2} = 0.5 / 0.8 = 0.625

To minimize variance, we should hold a portfolio consisting of 5/85/8 S1S_1 and 3/83/8 S2S_2.

There are tradeoffs in any optimization. For example, optimizing for maximal expected return may introduce high levels of volatility into the portfolio. Conversely, optimizing for minimal variance may result in paltry returns.

Probability Distributions

In this lesson, we are going to review several popular discrete and continuous distributions.

Bernoulli (Discrete)

Suppose we have a random variable, XBernoulli(p)X \sim \text{Bernoulli}(p). XX has the following pmf:

f(x)={pif x=11p(=q)if x=0f(x) = \left\{\begin{matrix} p & \text{if } x = 1 \\ 1 - p (= q) & \text{if } x = 0 \end{matrix}\right.

Additionally, XX has the following properties:

E[X]=pE[X] = p
Var(X)=pq\text{Var}(X) = pq
MX(t)=pet+qM_X(t) = pe^t + q

Binomial (Discrete)

The Bernoulli distribution generalizes to the binomial distribution. Suppose we have nn iid Bernoulli random variables: X1,X2,...,XniidBern(p)X_1,X_2,...,X_n \overset{\text{iid}}\sim \text{Bern}(p). Each XiX_i takes on the value 11 with probability pp and 00 with probability 1p1-p. If we take the sum of the successes, we have the following random variable, YY:

Y=i=1nXiBin(n,p)Y = \sum_{i = 1}^n X_i \sim \text{Bin}(n,p)

YY has the following pmf:

f(y)=(ny)pyqny,y=0,1,...,n.f(y) = \binom{n}{y}p^yq^{n-y}, \quad y = 0, 1,...,n.

Notice the binomial coefficient in this equation. We read this as "n choose k", which is defined as:

(ny)=n!k!(nk)!\binom{n}{y} = \frac{n!}{k!(n-k)!}

What's going on here? First, what is the probability of yy successes? Well, completely, it's the probability of yy successes and nyn-y failures: pyqnyp^yq^{n-y}. Of course, the outcome of yy consecutive successes followed by nyn-y consecutive failures is just one particular arrangement of many. How many? nn choose kk. This is what the binomial coefficient expresses.

Additionally, YY has the following properties:

E[Y]=npE[Y] = np
Var(Y)=npq\text{Var}(Y) = npq
MY(t)=(pet+q)nM_Y(t) = (pe^t + q)^n

Note that the variance and the expected value are equal to nn times the variance and the expected value of the Bernoulli random variable. This relationship makes sense: a binomial random variable is the sum of nn Bernoulli's. The moment-generating function looks a little bit different. As it turns out, we multiply the moment-generating functions when we sum the random variables.

Geometric (Discrete)

Suppose we have a random variable, XGeometric(p)X \sim \text{Geometric}(p). A geometric random variable corresponds to the number of Bern(p)\text{Bern}(p) trials until a success occurs. For example, three failures followed by a success ("FFFS") implies that X=4X = 4. A geometric random variable has the following pmf:

f(x)=qx1p,x=1,2,...f(x) = q^{x-1}p, \quad x = 1,2,...

We can see that this equation directly corresponds to the probability of x1x - 1 failures, each with probability qq followed by one success, with probability pp.

Additionally, XX has the following properties:

E[X]=1pE[X] = \frac{1}{p}
Var(X)=qp2\text{Var}(X) = \frac{q}{p^2}
MX(t)=pet1qetM_X(t) = \frac{pe^t}{1-qe^t}

Negative Binomial (Discrete)

The geometric distribution generalizes to the negative binomial distribution. Suppose that we are interested in the number of trials it takes to see rr successes. We can add rr iid Geom(p)\text{Geom}(p) random variables to get the random variable YNegBin(r,p)Y \sim \text{NegBin}(r, p). For example, if r=3r = 3, then a run of "FFFSSFS" implies that YNegBin(3,p)=7Y \sim \text{NegBin}(3, p) = 7. YY has the following pmf:

f(y)=(y1r1)qyrpr,y=r,r+1,...f(y) = \binom{y-1}{r-1}q^{y-r}p^{r}, \quad y = r, r + 1,...

Additionally, YY has the following properties:

E[Y]=rpE[Y] = \frac{r}{p}
Var(Y)=qrp2\text{Var}(Y) = \frac{qr}{p^2}

Note that the variance and the expected value are equal to rr times the variance and the expected value of the geometric random variable. This relationship makes sense: a negative binomial random variable is the sum of rr geometric random variables.

Poisson (Discrete)

A counting process, N(t)N(t) keeps track of the number of "arrivals" observed between time 00 and time tt. For example, if 77 people show up to a store by time t=3t=3, then N(3)=7N(3) = 7. A Poisson process is a counting process that satisfies the following criteria.

  1. Arrivals must occur one-at-a-time at a rate, λ\lambda. For example, λ=4/hr\lambda = 4 / \text{hr} means that, on average, arrivals occur every fifteen minutes, yet no two arrivals coincide.
  2. Disjoint time increments are independent. Suppose we are looking at arrivals on the intervals 12 am - 2 am and 5 am - 10 am. Independent increments means that the arrivals in the first interval don't impact arrivals in the second.
  3. Increments are stationary; in other words, the distribution of the number of arrivals in the interval [s,s+t][s, s + t] depends only on the interval's length, tt. It does not depend on where the interval starts, ss.

A random variable XPois(λ)X \sim \text{Pois}(\lambda) describes the number of arrivals that a Poisson process experiences in one time unit, i.e., N(1)N(1). XX has the following pmf:

f(x)=eλλxx!,x=0,1,...f(x) = \frac{e^{-\lambda}\lambda^x}{x!}, \quad x = 0,1,...

Additionally, XX has the following properties:

E[X]=Var(X)=λE[X] = \text{Var}(X) = \lambda
MX(t)=eλ(et1)M_X(t) = e^{\lambda(e^t - 1)}

Uniform (Continuous)

A uniform random variable, XUniform(a,b)X \sim \text{Uniform}(a,b), has the following pdf:

f(x)=1ba,axbf(x) = \frac{1}{b - a}, \quad a \leq x \leq b

Additionally, XX has the following properties:

E[X]=a+b2E[X] = \frac{a + b}{2}
Var(X)=(ba)212\text{Var}(X) = \frac{(b-a)^2}{12}
MX(t)=etbetatbtaM_X(t) = \frac{e^{tb} - e^{ta}}{tb - ta}

Exponential (Continuous)

A continuous, exponential random variable XExponential(λ)X \sim \text{Exponential}(\lambda) has the following pdf:

f(x)=λeλx,x0f(x) = \lambda e^{-\lambda x}, \quad x \geq 0

Additionally, XX has the following properties:

E[X]=1λE[X] = \frac{1}{\lambda}
Var(X)=1λ2\text{Var}(X) = \frac{1}{\lambda^2}
MX(t)=λλt,t<λM_X(t) = \frac{\lambda}{\lambda - t}, \quad t < \lambda

The exponential distribution also has a memoryless property, which means that for s,t>0s, t > 0, P(X>s+tX>s)=P(X>t)P(X > s + t | X > s) = P(X > t). For example, if we have a light bulb, and we know that it has lived for ss time units, the conditional probability that it will live for s+ts + t time units (an additional tt units), is the unconditional probability that it will live for tt time units. Analogously, there is no "memory" of the prior ss time units.

Let's look at a concrete example. If XExp(1/100)X \sim \text{Exp}(1/100), then:

P(X>200X>50)=P(X>150)=eλt=e150/100P(X > 200 | X > 50) = P(X > 150) = e^{\lambda t} = e^{-150/100}

Gamma (Continuous)

Recall the gamma function, Γ(α)\Gamma(\alpha):

Γ(α)0ta1etdt\Gamma(\alpha) \equiv \int_0^\infty t^{a-1}e^{-t}dt

A gamma random variable, XGamma(α,λ)X \sim \text{Gamma}(\alpha, \lambda), has the following pdf:

f(x)=λαxα1eλxΓ(α),x0f(x) = \frac{\lambda^\alpha x^{\alpha - 1} e^{-\lambda x}}{\Gamma(\alpha)}, \quad x \geq 0

Additionally, XX has the following properties:

E[X]=αλE[X] = \frac{\alpha}{\lambda}
Var(X)=αλ2\text{Var}(X) = \frac{\alpha}{\lambda^2}
MX(t)=[λ/(λt)]α,t<λM_X(t) = \left[\lambda /(\lambda - t)\right]^\alpha, \quad t < \lambda

Note what happens when α=1\alpha = 1: the gamma random variable reduces to an exponential random variable. Another way to say this is that the gamma distribution generalizes the exponential distribution.

If X1,X2,...,XniidExp(λ)X_1, X_2,...,X_n \overset{\text{iid}}{\sim} \text{Exp}(\lambda), then we can express a new random variable, YY:

Yi=1nXiGamma(n,λ)Y \equiv \sum_{i=1}^n X_i \sim \text{Gamma}(n, \lambda)

The Gamma(n,λ)\text{Gamma}(n,\lambda) distribution is also called the Erlangn(λ)\text{Erlang}_n(\lambda) distribution, and has the following cdf:

FY(y)=1eλyj=0n1(λy)jj!,y0F_Y(y) = 1- e^{-\lambda y}\sum_{j=0}^{n-1}\frac{(\lambda y)^j}{j!}, \quad y \geq0

Triangular (Continuous)

Suppose that we have a random variable, XTriangular(a,b,c)X \sim \text{Triangular}(a,b,c). The triangular distribution is parameterized by three fields - the smallest possible observation, aa, the "most likely" observation, bb, and the largest possible observation, cc - and is useful for models with limited data. XX has the following pdf:

f(x)={2(xa)(ba)(ca)if a<xb2(cx)(cb)(ca)if b<xc0otherwisef(x) = \left\{\begin{matrix} \frac{2(x-a)}{(b-a)(c-a)} & \text{if } a < x \leq b \\ \frac{2(c - x)}{(c-b)(c-a)} & \text{if } b < x \leq c \\ 0 & \text{otherwise} \end{matrix}\right.

Additionally, XX has the following property:

E[X]=a+b+c3E[X] = \frac{a + b + c}{3}

Beta (Continuous)

Suppose we have a random variable, XBeta(a,b)X \sim \text{Beta}(a,b). XX has the following pdf:

f(x)=Γ(a+b)Γ(a)+Γ(b)xa1(1x)b1,0x1,a,b>0f(x) = \frac{\Gamma(a+b)}{\Gamma(a) + \Gamma(b)}x^{a-1}(1-x)^{b-1}, \quad 0 \leq x \leq 1, a, b> 0

Additionally, XX has the following properties:

E[X]=aa+bE[X] = \frac{a}{a+b}
Var(X)=ab(a+b)2(a+b+1)\text{Var}(X) = \frac{ab}{(a+b)^2(a+b+1)}

Normal (Continuous)

Suppose we have a random variable, XNormal(μ,σ2)X \sim \text{Normal}(\mu, \sigma^2). XX has the following pdf, which, when graphed, produces the famous bell curve:

f(x)=12πσ2exp[(xμ)22σ2],xR, where exp(x)=exf(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left[\frac{-(x - \mu)^2}{2\sigma^2}\right], \quad x \in \mathbb{R}, \quad \text{ where } \exp(x) = e^x

Additionally, XX has the following properties:

E[X]=μVar(X)=σ2MX(t)=exp(μt+12σ2t2)\begin{alignedat}{1} E[X] &= \mu \\[2ex] \text{Var}(X) &= \sigma^2 \\[2ex] M_X(t) &= \exp\left(\mu t + \frac{1}{2}\sigma^2t^2\right) \end{alignedat}

Theorem

If XNor(μ,σ2)X \sim \text{Nor}(\mu, \sigma^2), then a linear function of XX is also normal: aX+bNor(aμ+b,a2σ2)aX + b \sim \text{Nor}\left(a\mu + b, a^2 \sigma^2\right). Interestingly, if we set a=1/σa = 1 / \sigma and b=μ/σb = -\mu / \sigma, then Z=aX+bZ = aX + b is:

ZXμσNor(0,1)Z \equiv \frac{X - \mu}{\sigma} \sim \text{Nor}(0, 1)

ZZ is drawn from the standard normal distribution, which has the following pdf:

ϕ(z)12πez2/2\phi(z) \equiv \frac{1}{\sqrt{2\pi}}e^{-z^2 / 2}

This distribution also has the cdf Φ(z)\Phi(z), which is tabled. For example, Φ(1.96)0.975\Phi(1.96) \doteq 0.975.

Theorem

If X1X_1 and X2X_2 are independent with XiNor(μi,σi2),i=1,2X_i \sim \text{Nor}(\mu_i,\sigma_i^2), i = 1,2, then X1+X2Nor(μ1+μ2,σ12+σ22)X_1 + X_2 \sim \text{Nor}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2).

For example, suppose that we have two random variables, XNor(3,4)X \sim \text{Nor}(3,4) and YNor(4,6)Y \sim \text{Nor}(4,6) and XX and YY are independent. How is Z=2X3Y+1Z = 2X - 3Y + 1 distributed? Let's first start with 2X2X:

2X (Nor(23,224)Nor(6,16)2X ~ \sim (\text{Nor}(2 * 3, 2 * 2 * 4) \equiv \text{Nor}(6, 16)

Now, let's find the distribution of 3Y+1-3Y + 1:

3Y+1(Nor(34+1,336)Nor(11,54)-3Y + 1 \sim (\text{Nor}(-3 * 4 + 1, -3 * -3 * 6) \equiv \text{Nor}(-11, 54)

Finally, we combine the two distributions linearly:

2X3Y+1(Nor(611,54+16)Nor(5,70))2X - 3Y + 1 \sim (\text{Nor}(6 - 11, 54 + 16) \equiv \text{Nor}(-5, 70))

Let's look at the corollary of a previous theorem. If X1,...,XnX_1, ..., X_n are iid Nor(μ,σ2)\text{Nor}(\mu, \sigma^2), then the sample mean XˉnNor(μ,σ2/n)\bar{X}_n \sim \text{Nor}(\mu, \sigma^2 / n). While the sample mean has the same mean as the distribution from which the observations were sampled, its variance has decreased by a factor of nn. As we get more information, the sample mean becomes less variable. This observation is a special case of the law of large numbers, which states that Xˉn\bar{X}_n approximates μ\mu as nn becomes large.

Limit Theorems

In this lesson, we are going to see what happens when we sample a large number of iid random variables. As we will see, normality happens, and, in particular, we will be looking at the central limit theorem.

Corollary (of a previous theorem)

If X1,...,XnX_1,..., X_n are iid Nor(μ,σ2)\text{Nor}(\mu, \sigma^2), then the sample mean, Xˉn\bar{X}_n, is distributed as Nor(μ,σ2/n)\text{Nor}(\mu, \sigma^2 / n). In other words:

Xˉn1ni=1[XiNor(μ,σ2)]Nor(μ,σ2/n)\bar{X}_n \equiv \frac{1}{n}\sum_{i =1}\left[X_i \sim \text{Nor}(\mu, \sigma^2) \right] \sim \text{Nor}(\mu, \sigma^2 / n)

Notice that Xˉ\bar{X} has the same mean as the distribution from which the observations were sampled, but the variance decreases as a factor of the number of samples, nn. Said another way, the variability of Xˉ\bar{X} decreases as nn increases, driving Xˉ\bar{X} toward μ\mu.

As we said before, this observation is a special case of the law of large numbers, which states that Xˉ\bar{X} approximates μ\mu as nn becomes large.

Definition

Suppose we have a sequence of random variables, Y1,Y2,...Y_1, Y_2,..., with respective cdf's, FY1(y),FY2(y),....F_{Y_1}(y), F_{Y_2}(y),.... This series of random variables converges in distribution to the random variable YY having cdf FY(y)F_Y(y) if limnFYn(y)=FY(y)\lim_{n \to \infty}F_{Y_n}(y) = F_Y(y) for all yy. We express this idea of converging in distribution with the following notation:

YndYY_n \overset{d}{\to} Y

How do we use this? Well, if YndYY_n \overset{d}{\to} Y, and nn is large, then we can approximate the distribution of YnY_n by the limit distribution of YY.

Central Limit Theorem

Suppose X1,X2,...,XnX_1, X_2,...,X_n are iid random variables sampled from some distribution with pdf/pmf f(x)f(x) having mean μ\mu and variance σ2\sigma^2. Let's define a random variable ZnZ_n:

Zni=1nXinμnσZ_n \equiv \frac{\sum_{i=1}^n X_i - n\mu}{\sqrt{n}\sigma}

We can simplify this expression. Let's first split the sum:

Zni=1nXinσnμnσsZ_n \equiv \frac{\sum_{i=1}^n X_i}{\sqrt{n}\sigma} - \frac{n\mu}{\sqrt{n}\sigma}s

Let's work on the first term:

i=1nXinσ=ni=1nXinσ\begin{aligned} \frac{\sum_{i=1}^n X_i}{\sqrt{n}\sigma} = \frac{\sqrt{n}\sum_{i=1}^n X_i}{n\sigma} \\[2ex] \end{aligned}

Note that the sum the random variables divided by nn equals Xnˉ\bar{X_n}:

ni=1nXinσ=nXnˉσ\begin{aligned} \frac{\sqrt{n}\sum_{i=1}^n X_i}{n\sigma}= \frac{\sqrt{n}\bar{X_n}}{\sigma} \end{aligned}

Now, let's work on the second term, dividing nn by n\sqrt{n}:

μnnσ=nμσ\begin{aligned} \frac{\mu n}{\sqrt{n}\sigma} = \frac{\sqrt{n}\mu}{\sigma} \end{aligned}

Let's combine the two terms:

Zni=1nXinμnσ=nXnˉσnμσ=n(Xnˉμ)σ\begin{aligned} Z_n \equiv \frac{\sum_{i=1}^n X_i - n\mu}{\sqrt{n}\sigma} = \frac{\sqrt{n}\bar{X_n}}{\sigma} - \frac{\sqrt{n}\mu}{\sigma} \\[2ex] = \frac{\sqrt{n}(\bar{X_n} - \mu)}{\sigma} \end{aligned}

The expression above converges in distribution to a Nor(0,1)\text{Nor}(0, 1) distribution:

Znn(Xnˉμ)σdNor(0,1).Z_n \equiv \frac{\sqrt{n}(\bar{X_n} - \mu)}{\sigma} \overset{d}{\to} \text{Nor}(0, 1).

Thus, the cdf of ZnZ_n approaches ϕ(z)\phi(z) as nn increases.

The central limit theorem works well if the pdf/pmf is fairly symmetric and the number of samples, nn, is greater than fifteen.

Example

Suppose that we have 100100 observations, X1,X2,...,X100iidExp(1)X_1, X_2,...,X_{100} \overset{\text{iid}}{\sim} \text{Exp}(1). Note that, with λ=1\lambda = 1, μ=σ2=1\mu = \sigma^2 = 1. What is the probability that the sum of all 100100 random variables falls between 9090 and 100100:

P(90i=1100Xi110)=?P\left(90 \leq \sum_{i =1}^{100} X_i \leq 110 \right) = \quad ?

We can use the central limit theorem to approximate this probability. Let's apply f(x)=(xnμ)/nσf(x) = (x - n\mu) / \sqrt{n}\sigma to the inequality, converting the sum of our observations to Z100Z_{100}:

P(90i=1100Xi110)=P(90100100Z100110100100)P\left(90 \leq \sum_{i =1}^{100} X_i \leq 110 \right) = P\left(\frac{90-100}{\sqrt{100}} \leq Z_{100} \leq \frac{110 - 100}{\sqrt{100}} \right)

Since ZndNor(0,1)Z_n \overset{d}{\to} \text{Nor}(0,1), we can approximate that Z100Nor(0,1)Z_{100} \sim \text{Nor}(0,1):

P(90100100Z100110100100)P(1Z100Nor(0,1)1)P\left(\frac{90-100}{\sqrt{100}} \leq Z_{100} \leq \frac{110 - 100}{\sqrt{100}} \right) \approx P(-1 \leq Z_{100} \sim \text{Nor}(0,1) \leq 1)

This final expression is asking for the probability that a Nor(0,1)\text{Nor}(0,1) random variable falls between 1-1 and 11. The standard normal distribution has a standard deviation of 11, and we know that the probability of falling within one standard deviation of the mean is 0.682768%0.6827 \approx 68\%.

Now, remember that the sum of exponential random variables is itself an erlang random variable:

i=1100XiErlangk=100(λ)\sum_{i = 1}^{100} X_i \sim \text{Erlang}_{k=100}(\lambda)

Erlang random variables have cdf's, which we can use - directly or through software such as Minitab - to obtain the exact value of the probability above:

P(90i=1100Xi110)=0.6835P\left(90 \leq \sum_{i =1}^{100} X_i \leq 110 \right) = 0.6835

Not bad.

Exercises

Introduction to Estimation (OPTIONAL)

In this lesson, we are going to start our review of basic statistics. In particular, we are going to talk about unbiased estimation and mean squared error.

Statistic Definition

A statistic is a function of the observations X1,...,XnX_1,...,X_n that is not explicitly dependent on any unknown parameters. For example, the sample mean, Xˉ\bar{X}, and the sample variance, S2S^2, are two statistics:

Xˉ=1ni=1nXiS2=1n1i=1x(XiXˉ)2\begin{alignedat}{1} \bar{X} = &\frac{1}{n} \sum_{i=1}^n X_i \\[2ex] S^2 = &\frac{1}{n-1} \sum_{i=1}^x(X_i - \bar{X})^2 \end{alignedat}

Statistics are random variables. In other words, if we take two different samples, we should expect to see two different values for a given statistic.

We usually use statistics to estimate some unknown parameter from the underlying probability distribution of the XiX_i's. For instance, we use the sample mean, Xˉ\bar{X}, to estimate the true mean, μ\mu, of the underlying distribution, which we won't normally know. If μ\mu is the true mean, then we can take a bunch of samples and use Xˉ\bar{X} to estimate μ\mu. We know, via the law of large numbers that, as nn \to \infty, Xˉμ\bar{X} \to \mu.

Point Estimator

Let's suppose that we have a collection of iid random variables, X1,...,XnX_1,...,X_n. Let T(X)T(X1,...,Xn)T(\bold{X}) \equiv T(X_1,...,X_n) be a function that we can compute based only on the observations. Therefore, T(X)T(\bold{X}) is a statistic. If we use T(X)T(\bold{X}) to estimate some unknown parameter θ\theta, then T(X)T(\bold{X}) is known as a point estimator for θ\theta.

For example, Xˉ\bar{X} is usually a point estimator for the true mean, μ=E[Xi]\mu = E[X_i], and S2S^2 is often a point estimator for the true variance, σ2=Var(X)\sigma^2 = \text{Var}(X).

T(X)T(\bold{X}) should have specific properties:

  • Its expected value should equal the parameter it's trying to estimate. This property is known as unbiasedness.
  • It should have a low variance. It doesn't do us any good if T(X)T(\bold{X}) is bouncing around depending on the sample we take.

Unbiasedness

We say that T(X)T(\bold{X}) is unbiased for θ\theta if E[T(X)]=θE[T(\bold{X})] = \theta. For example, suppose that random variables, X1,...,XnX_1,...,X_n are iid anything with mean μ\mu. Then:

E[Xˉ]=E[1ni=1nXi]=1ni=1nE[Xi]=nE[Xi]n=E[Xi]=μ\begin{alignedat}{1} E[\bar{X}] & = E\left[\frac{1}{n}\sum_{i=1}^nX_i\right] \\[3ex] & = \frac{1}{n}\sum_{i=1}^nE[X_i] \\[3ex] & = \frac{nE[X_i]}{n} \\[2ex] & = E[X_i] = \mu \end{alignedat}

Since E[Xˉ]=μE[\bar{X}] = \mu, Xˉ\bar{X} is always unbiased for μ\mu. That's why we call it the sample mean.

Similarly, suppose we have random variables, X1,...,XnX_1,...,X_n which are iid Exp(λ)\text{Exp}(\lambda). Then, Xˉ\bar{X} is unbiased for μ=E[Xi]=1/λ\mu = E[X_i] = 1 / \lambda. Even though λ\lambda is unknown, we know that Xˉ\bar{X} is a good estimator for 1/λ1/ \lambda.

Be careful, though. Just because Xˉ\bar{X} is unbiased for 1/λ1 / \lambda does not mean that 1/Xˉ1 / \bar{X} is unbiased for λ\lambda: E[1/Xˉ]1/E[Xˉ]=λE[1/\bar{X}] \neq 1 /E[\bar{X}] = \lambda. In fact, 1/Xˉ1/\bar{X} is biased for λ\lambda in this exponential case.

Here's another example. Suppose that random variables, X1,...,XnX_1,...,X_n are iid anything with mean μ\mu and variance σ2\sigma^2. Then:

E[S2]=E[i=1n(XiXˉ)2n1]=Var(Xi)=σ2E[S^2] = E\left[\frac{\sum_{i=1}^n(X_i - \bar{X})^2}{n - 1}\right] = \text{Var}(X_i) = \sigma^2

Since E[S2]=σ2E[S^2] = \sigma^2, S2S^2 is always unbiased for σ2\sigma^2. That's why we called it the sample variance.

For example, suppose random variables X1,...,XnX_1,...,X_n are iid Exp(λ)\text{Exp}(\lambda). Then S2S^2 is unbiased for Var(Xi)=1/λ2\text{Var}(X_i) = 1 / \lambda^2.

Let's give a proof for the unbiasedness of S2S^2 as an estimate for σ2\sigma^2. First, let's convert S2S^2 into a better form:

S2=i=1n(XiXˉ)2n1=i=1nXi22XiXˉ+Xˉ2n1=1n1(i=1nXi22XiXˉ+Xˉ2)=1n1(i=1nXi2i=1n2XiXˉ+i=1nXˉ2)\begin{alignedat}{1} S^2 & = \frac{\sum_{i=1}^n(X_i - \bar{X})^2}{n - 1} \\[3ex] & = \frac{\sum_{i=1}^n X_i^2 -2X_i\bar{X} + \bar{X}^2}{n - 1} \\[3ex] & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 -2X_i\bar{X} + \bar{X}^2\right) \\[3ex] & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 - \sum_{i=1}^n 2X_i\bar{X} + \sum_{i=1}^n \bar{X}^2\right) \\[3ex] \end{alignedat}

Let's rearrange the middle sum:

i=1n2XiXˉ=2Xˉi=1nXi \sum_{i=1}^n 2X_i\bar{X} = 2\bar{X}\sum_{i=1}^n X_i

Remember that Xˉ\bar{X} represents the average of all the XiX_i's: Xi/n\sum X_i / n. Thus, if we just sum the XiX_i's and don't divide by nn, we have a quantity equal to nXˉn\bar{X}:

2Xˉi=1nXi=2XˉnXˉ=2nXˉ22\bar{X}\sum_{i=1}^n X_i = 2\bar{X}n\bar{X} = 2n\bar{X}^2

Now, back to action:

S2=1n1(i=1nXi2i=1n2XiXˉ+i=1nXˉ2)=1n1(i=1nXi22nXˉ2+nXˉ2)=1n1(i=1nXi2nXˉ2)=i=1nXi2nXˉ2n1\begin{alignedat}{1} S^2 & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 - \sum_{i=1}^n 2X_i\bar{X} + \sum_{i=1}^n \bar{X}^2\right) \\[3ex] & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 - 2n\bar{X}^2 + n\bar{X}^2\right) \\[3ex] & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 - n\bar{X}^2 \right) \\[3ex] & = \frac{\sum_{i=1}^n X_i^2 - n\bar{X}^2}{n-1} \\[3ex] \end{alignedat}

Let's take the expected value:

E[S2]=i=1nE[Xi2]nE[Xˉ2]n1\begin{alignedat}{1} E[S^2] & = \frac{\sum_{i=1}^n E[X_i^2] - nE[\bar{X}^2]}{n-1} \end{alignedat}

Note that E[Xi2]E[X_i^2] is the same for all XiX_i, so the sum is just nE[X12]nE[X_1^2]:

E[S2]=nE[X12]nE[Xˉ2]n1=1n1(E[X12]E[Xˉ2])\begin{alignedat}{1} E[S^2] & = \frac{n E[X_1^2] - nE[\bar{X}^2]}{n-1} \\[2ex] & = \frac{1}{n-1} \left(E[X_1^2] - E[\bar{X}^2]\right) \end{alignedat}

We know that Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2, so E[X2]=Var(X)+(E[X])2E[X^2] = \text{Var}(X) + (E[X])^2. Therefore:

E[S2]=1n1(Var(X1)+(E[X1]2)Var(Xˉ)(E[Xˉ]2))\begin{alignedat}{1} E[S^2] & = \frac{1}{n-1} \left(\text{Var}(X_1) + (E[X_1]^2) - \text{Var}(\bar{X}) - (E[\bar{X}]^2)\right) \end{alignedat}

Remember that E[X1]=E[Xˉ]E[X_1] = E[\bar{X}], so:

E[S2]=1n1(Var(X1)Var(Xˉ))\begin{alignedat}{1} E[S^2] & = \frac{1}{n-1} \left(\text{Var}(X_1) - \text{Var}(\bar{X}) \right) \\[3ex] \end{alignedat}

Furthermore, remember that Var(Xˉ)=Var(X1)/n=σ2/n\text{Var}(\bar{X}) = \text{Var}(X_1) /n = \sigma_2 / n. Therefore:

=1n1(σ2σ2/n))=nσ2n1σ2n1=nσ2σ2n1=σ2(n1)n1=σ2\begin{alignedat}{1} & = \frac{1}{n-1} \left(\sigma^2 - \sigma^2/n) \right) \\[3ex] & = \frac{n\sigma^2}{n-1} - \frac{\sigma^2}{n-1} \\[3ex] & = \frac{n\sigma^2- \sigma^2}{n-1} = \frac{\sigma^2(n-1)}{n-1} = \sigma^2 \end{alignedat}

Unfortunately, while S2S^2 is unbiased for the variance σ2\sigma^2, SS is biased for the standard deviation σ\sigma.

Example

Suppose that we have nn random variables, X1,...,XniidUnif(0,θ)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Unif}(0,\theta). We know that our observations have a lower bound of 00, but we don't know the value of the upper bound, θ\theta. As is often the case, we sample many observations from the distribution and use that sample to estimate the unknown parameter.

Consider two estimators for θ\theta:

Y12XˉY2n+1nmax1iXiXi\begin{alignedat}{1} Y_1 &\equiv 2\bar{X} \\[2ex] Y_2 &\equiv \frac{n+1}{n} \max_{1 \leq i \leq X_i}X_i \end{alignedat}

Let's look at the first estimator. We know that E[Y1]=2E[Xˉ]E[Y_1] = 2E[\bar{X}], by definition. Similarly, we know that 2E[Xˉ]=2E[Xi]2E[\bar{X}] = 2E[X_i], since Xˉ\bar{X} is always unbiased for the mean. Recall how we compute the expected value for a uniform random variable:

E[A]=(ba)/2,AUnif(a,b)E[A] = (b - a) / 2, \quad A \sim \text{Unif}(a,b)

Therefore:

2E[Xi]=2(θ02)=θ=E[Y1]2E[X_i] = 2\left(\frac{\theta - 0}{2}\right) = \theta = E[Y_1]

As we can see, Y1Y_1 is unbiased for θ\theta.

It's also the case that Y2Y_2 is unbiased, but it takes more work to demonstrate. As a first step, take the cdf of the maximum of the XiX_i's, MmaxiXiM \equiv \max_iX_i. Here's what P(My)P(M \leq y) looks like:

P(My)=P(X1y and X2y and  and Xny)P(M \leq y) = P(X_1 \leq y \text{ and } X_2 \leq y \text{ and } \dots \text{ and } X_n \leq y)

If MyM \leq y, and MM is the maximum, then P(My)P(M \leq y) is the probability that all the XiX_i's are less than yy. Since the XiX_i's are independent, we can take the product of the individual probabilities:

P(My)=i=inP(Xiy)=[P(X1y)]nP(M \leq y) = \prod_{i = i}^n P(X_i \leq y) = [P(X_1 \leq y)]^n

Now, we know, by definition, that the cdf is the integral of the pdf. Remember that the pdf for a uniform distribution, Unif(a,b)\text{Unif}(a,b), is:

f(x)=1ba,a<x<bf(x) = \frac{1}{b-a}, a < x < b

Let's rewrite P(My)P(M \leq y):

P(My)=[0yfX1(x)dx]n=[0y1θdx]n=(yθ)n\begin{alignedat}{1} P(M \leq y) &= \left[\int_0^yf_{X_1}(x)dx\right]^n \\[2ex] & = \left[\int_0^y\frac{1}{\theta}dx\right]^n \\[2ex] & = \left(\frac{y}{\theta}\right)^n \end{alignedat}

Again, we know that the pdf is the derivative of the cdf, so:

fM(y)=FM(y)dy=ddyynθn=nyn1θn\begin{alignedat}{1} f_M(y) & = F_M(y)dy \\[2ex] & = \frac{d}{dy} \frac{y^n}{\theta^n} \\[2ex] & = \frac{ny^{n-1}}{\theta^n} \end{alignedat}

With the pdf in hand, we can get the expected value of MM:

E[M]=0θyfM(y)dy=0θnynθndy=nyn+1(n+1)θn0θ=nθn+1(n+1)θn=nθn+1\begin{alignedat}{1} E[M] & = \int_0^\theta yf_M(y)dy \\[2ex] & = \int_0^\theta \frac{ny^n}{\theta^n} dy \\[2ex] & = \frac{ny^{n+1}}{(n+1)\theta^n}\Big|_0^\theta \\[2ex] & = \frac{n\theta^{n+1}}{(n+1)\theta^n} \\[2ex] & = \frac{n\theta}{n+1} \\[2ex] \end{alignedat}

Note that E[M]θE[M] \neq \theta, so MM is not an unbiased estimator for θ\theta. However, remember how we defined Y2Y_2:

Y2n+1nmax1iXiXiY_2 \equiv \frac{n+1}{n} \max_{1 \leq i \leq X_i}X_i

Thus:

E[Y2]=n+1nE[M]=n+1n(nθn+1)=θ\begin{alignedat}{1} E[Y_2] & = \frac{n+1}{n}E[M] \\[2ex] & = \frac{n+1}{n} \left(\frac{n\theta}{n+1}\right) \\[2ex] & = \theta \end{alignedat}

Therefore, Y2Y_2 is unbiased for θ\theta.

Both indicators are unbiased, so which is better? Let's compare variances now. After similar algebra, we see:

Var(Y1)=θ23n,Var(Y2)=θ2n(n+2)\text{Var}(Y_1) = \frac{\theta^2}{3n}, \quad \text{Var}(Y_2) = \frac{\theta^2}{n(n+2)}

Since the variance of Y2Y_2 involves dividing by n2n^2, while the variance of Y1Y_1 only divides by nn, Y2Y_2 has a much lower variance than Y1Y_1 and is, therefore, the better indicator.

Bias and Mean Squared Error

The bias of an estimator, T[X]T[\bold{X}], is the difference between the estimator's expected value and the value of the parameter its trying to estimate: Bias(T)E[T]θ\text{Bias}(T) \equiv E[T] - \theta. When E[T]=θE[T] = \theta, then the bias is 00 and the estimator is unbiased.

The mean squared error of an estimator, T[X]T[\bold{X}], the expected value of the squared deviation of the estimator from the parameter: MSE(T)E[(Tθ)2]\text{MSE}(T) \equiv E[(T-\theta)^2].

Remember the equation for variance:

Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2

Using this equation, we can rewrite MSE(T)\text{MSE}(T):

MSE(T)=E[(Tθ)2]=Var(T)+(E[T]θ)2=Var(T)+Bias2\begin{alignedat}{1} \text{MSE}(T) & = E[(T-\theta)^2] \\[2ex] & = \text{Var}(T) + (E[T] - \theta)^2 \\[2ex] & = \text{Var}(T) + \text{Bias}^2 \\[2ex] \end{alignedat}

Usually, we use mean squared error to evaluate estimators. As a result, when selecting between multiple estimators, we might not choose the unbiased estimator, so long as that estimator's MSE is the lowest among the options.

Relative Efficiency

The relative efficiency of one estimator, T1T_1, to another, T2T_2, is the ratio of the mean squared errors: MSE(T1)/MSE(T2)\text{MSE}(T_1) / \text{MSE}(T_2). If the relative efficiency is less than one, we want T1T_1; otherwise, we want T2T_2.

Let's compute the relative efficiency of the two estimators we used in the previous example:

Y12XˉY2n+1nmax1iXiXi\begin{alignedat}{1} Y_1 &\equiv 2\bar{X} \\[2ex] Y_2 &\equiv \frac{n+1}{n} \max_{1 \leq i \leq X_i}X_i \end{alignedat}

Remember that both estimators are unbiased, so the bias is zero by definition. As a result, the mean squared errors of the two estimators is determined solely by the variance:

MSE(Y1)=Var(Y1)=θ23n,MSE(Y2)=Var(Y2)=θ2n(n+2)\text{MSE}(Y_1) = \text{Var}(Y_1) = \frac{\theta^2}{3n}, \quad \text{MSE}(Y_2) = \text{Var}(Y_2) = \frac{\theta^2}{n(n+2)}

Let's calculate the relative efficiency:

e(Y1,Y2)=θ23nθ2n(n+2)=θ2n(n+2)3nθ2=n(n+2)3n=n+23\begin{alignedat}{1} e(Y_1, Y_2) & = \frac{\frac{\theta^2}{3n}}{\frac{\theta^2}{n(n+2)}} \\[3ex] & = \frac{\theta^2 * n(n+2)}{3n\theta^2} \\[2ex] & = \frac{n(n+2)}{3n} \\[2ex] & = \frac{n+2}{3} \\[2ex] \end{alignedat}

The relative efficiency is greater than one for all n>1n > 1, so Y2Y_2 is the better estimator just about all the time.

Maximum Likelihood Estimation (OPTIONAL)

In this lesson, we are going to talk about maximum likelihood estimation, which is perhaps the most important point estimation method. It's a very flexible technique that many software packages use to help estimate parameters from various distributions.

Likelihood Function and Maximum Likelihood Estimator

Consider an iid random sample, X1,...,XnX_1,...,X_n, where each XiX_i has pdf/pmf f(x)f(x). Additionally, suppose that θ\theta is some unknown parameter from XiX_i that we would like to estimate. We can define a likelihood function, L(θ)L(\theta) as:

L(θ)i=1nf(xi)L(\theta) \equiv \prod_{i=1}^n f(x_i)

The maximum likelihood estimator (MLE) of θ\theta is the value of θ\theta that maximizes L(θ)L(\theta). The MLE is a function of the XiX_i's and is itself a random variable.

Example

Consider a random sample, X1,...,XniidExp(λ)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Exp}(\lambda). Find the MLE for λ\lambda. Note that, in this case, λ\lambda, is taking the place of the abstract parameter, θ\theta. Now:

L(λ)i=1nf(xi)L(\lambda) \equiv \prod_{i=1}^n f(x_i)

We know that exponential random variables have the following pdf:

f(x,λ)=λeλxf(x, \lambda) = \lambda e^{-\lambda x}

Therefore:

L(λ)i=1nλeλxiL(\lambda) \equiv \prod_{i=1}^n \lambda e^{-\lambda x_i}

Let's expand the product:

L(λ)=(λeλx1)(λeλx2)(λeλxn)L(\lambda) = (\lambda e^{-\lambda x_1}) * (\lambda e^{-\lambda x_2}) * \dots * (\lambda e^{-\lambda x_n})

We can pull out a λn\lambda^n term:

L(λ)=λn(eλx1)(eλx2)(eλxn)L(\lambda) = \lambda^n * (e^{-\lambda x_1}) * (e^{-\lambda x_2}) * \dots * (e^{-\lambda x_n})

Remember what happens to exponents when we multiply bases:

axay=ax+ya^x * a^y = a^{x+y}

Let's apply this to our product (and we can swap in exp\exp notation to make things easier to read):

L(λ)=λnexp[λi=1nxi]L(\lambda) = \lambda^n \exp\left[-\lambda \sum_{i=1}^nx_i\right]

Now, we need to maximize L(λ)L(\lambda) with respect to λ\lambda. We could take the derivative of L(λ)L(\lambda), but we can use a trick! Since the natural log function is one-to-one, the λ\lambda that maximizes L(λ)L(\lambda) also maximizes ln(L(λ))\ln(L(\lambda)). Let's take the natural log of L(λ)L(\lambda):

ln(L(λ))=ln(λnexp[λi=1nxi])\ln(L(\lambda)) = \ln\left(\lambda^n \exp\left[-\lambda \sum_{i=1}^nx_i\right]\right)

Let's remember three different rules here:

ln(ab)=ln(a)+ln(b),ln(ab)=bln(a),ln(ex)=x\ln(ab) = \ln(a) + \ln(b), \quad \ln(a^b) = b\ln(a), \quad \ln(e^x) = x

Therefore:

ln(L(λ))=ln(λn)+ln(exp[λi=1nxi])=nln(λ)+ln(exp[λi=1nxi])=nln(λ)λi=1nxi\begin{alignedat}{1} \ln(L(\lambda)) & = \ln(\lambda^n) +\ln\left(\exp\left[-\lambda \sum_{i=1}^nx_i\right]\right) \\[2ex] & = n\ln(\lambda) +\ln\left(\exp\left[-\lambda \sum_{i=1}^nx_i\right]\right) \\[2ex] & = n\ln(\lambda) -\lambda \sum_{i=1}^nx_i \\[2ex] \end{alignedat}

Now, let's take the derivative:

ddλln(L(λ))=ddλ(nln(λ)λi=1nxi)=nλi=1nxi\begin{alignedat}{1} \frac{d}{d\lambda}\ln(L(\lambda)) & = \frac{d}{d\lambda}\left(n\ln(\lambda) -\lambda \sum_{i=1}^nx_i\right) \\[2ex] & = \frac{n}{\lambda} - \sum_{i=1}^nx_i \\[2ex] \end{alignedat}

Finally, we set the derivative equal to 00 and solve for λ\lambda:

0nλi=1nxinλ=i=1nxiλ=ni=1nxiλ=1Xˉ\begin{alignedat}{1} 0 \equiv \frac{n}{\lambda} & - \sum_{i=1}^nx_i \\[2ex] \frac{n}{\lambda} & = \sum_{i=1}^nx_i \\[3ex] \lambda & = \frac{n}{\sum_{i=1}^nx_i} \\[2ex] \lambda & = \frac{1}{\bar{X}} \\[2ex] \end{alignedat}

Thus, the maximum likelihood estimator for λ\lambda is 1/Xˉ1 / \bar{X}, which makes a lot of sense. The mean of the exponential distribution is 1/λ1 / \lambda, and we usually estimate that mean by Xˉ\bar{X}. Since Xˉ\bar{X} is a good estimator for λ\lambda, it stands to reason that a good estimator for λ\lambda is 1/Xˉ1 / \bar{X}.

Conventionally, we put a "hat" over the λ\lambda that maximizes the likelihood function to indicate that it is the MLE. Such notation looks like this: λ^\hat{\lambda}.

Note that we went from "little x's", xix_i, to "big x", Xˉ\bar{X}, in the equation. We do this to indicate that λ^\hat{\lambda} is a random variable.

Just to be careful, we probably should have performed a second-derivative test on the function, ln(L(λ))\ln(L(\lambda)), to ensure that we found a maximum likelihood estimator and not a minimum likelihood estimator.

Invariance Property of MLE's

If θ^\hat{\theta} is the MLE of some parameter, θ\theta, and h()h(\cdot) is a 1:1 function, then h(θ^)h(\hat{\theta}) is the MLE of h(θ)h(\theta).

For example, suppose we have a random sample, X1,...,XniidExp(λ)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Exp}(\lambda). The survival function, Fˉ(x)\bar{F}(x), is:

Fˉ(x)=P(X>x)=1F(x)=1(1eλx)=eλx\bar{F}(x) = P(X > x) = 1 - F(x) = 1 - (1 - e^{-\lambda x}) = e^{-\lambda x}

In addition, we saw that the MLE for λ\lambda is λ^=1/Xˉ\hat{\lambda} = 1 / \bar{X}. Therefore, using the invariance property, we can see that the MLE for Fˉ(x)\bar{F}(x) is Fˉ(λ^)\bar{F}(\hat{\lambda}):

Fˉ^(x)=eλ^x=ex/Xˉ\hat{\bar{F}}(x) = e^{-\hat{\lambda}x} = e^{-x / \bar{X}}

Confidence Intervals (OPTIONAL)

In this lesson, we are going to expand on the idea of point estimators for parameters and look at confidence intervals. We will use confidence intervals through this course, especially when we look at output analysis.

Important Distributions

Suppose we have a random sample, Z1,Z2,...,ZkZ_1, Z_2,...,Z_k, that is iid Nor(0,1)\text{Nor}(0,1). Then, Y=i=1kZi2Y = \sum_{i=1}^k Z_i^2 has the χ2\chi^2 distribution with kk degrees of freedom. We can express YY as Yχ2(k)Y \sim \chi^2(k), and YY has an expected value of kk and a variance of 2k2k.

If ZNor(0,1),Yχ2kZ \sim \text{Nor}(0,1), Y \sim \chi^2{k}, and ZZ and YY are independent, then T=Z/Y/kT = Z / \sqrt{Y /k} has the Student's t distribution with kk degrees of freedom. We can express TT as Tt(k)T \sim t(k). Note that t(1)t(1) is the Cauchy distribution.

If Y1χ2(m),Y2χ2(n)Y_1 \sim \chi^2(m), Y_2 \sim \chi^2(n), and Y1Y_1 and Y2Y_2 are independent, then F=(Y1/m)/(Y2/n)F = (Y_1 /m)/(Y_2/n) has the F distribution with mm and nn degrees of freedom. Notation: FF(m,n)F \sim F(m,n).

We can use these distributions to construct confidence intervals for μ\mu and σ2\sigma^2 under a variety of assumptions.

Confidence Interval

A 100(1α)%100(1-\alpha)\% two-sided confidence interval for an unknown parameter, θ\theta, is a random interval, [L,U][L, U], such that P(LθU)=1αP(L \leq \theta \leq U) = 1 - \alpha.

Here is an example of a confidence interval. We might say that we believe that the proportion of people voting for some candidate is between 47% and 51%, and we believe that statement to be true with probability 0.95.

Common Confidence Intervals

Suppose that we know the variance of a distribution, σ2\sigma^2. Then a 100(1α)%100(1-\alpha)\% confidence interval for μ\mu is:

Xˉnzα/2σ2nμXˉn+zα/2σ2n\bar{X}_n - z_{\alpha/2}\sqrt{\frac{\sigma^2}{n}} \leq \mu \leq \bar{X}_n + z_{\alpha/2}\sqrt{\frac{\sigma^2}{n}}

Notice the structure of the inequality here: XˉnHμXˉn+H\bar{X}_n - H \leq \mu \leq \bar{X}_n + H. The expression HH, known as the half-length, is almost always a product of a constant and the square root of S2/nS^2 /n. In this rare case, since we know σ2\sigma^2, we can use it directly.

What is zz? zγz_\gamma is the 1γ1 - \gamma quantile of the standard normal distribution, which we can look up:

zγΦ1(1γ)z_\gamma \equiv \Phi^{-1}(1-\gamma)

Let's look at another example. If σ2\sigma^2 is unknown, then a 100(1α)%100(1-\alpha)\% confidence interval for μ\mu is:

Xˉntα/2,n1S2nμXˉn+tα/2,n1S2n\bar{X}_n - t_{\alpha/2,n-1}\sqrt{\frac{S^2}{n}} \leq \mu \leq \bar{X}_n + t_{\alpha/2,n-1}\sqrt{\frac{S^2}{n}}

This example looks very similar to the previous example except that we are using S2S^2 (because we don't know σ2\sigma^2), and we are using a tt quantile instead of a normal quantile. Note that tγ,νt_{\gamma, \nu} is the 1γ1 - \gamma quantile of the t(ν)t(\nu) distribution.

Finally, let's look at the 100(1α)%100(1-\alpha)\% confidence interval for σ2\sigma^2:

(n1)S2χα2,n12σ2(n1)S2χ1α2,n12\frac{(n-1)S^2}{\chi^2_{\frac{\alpha}{2}, n-1}} \leq \sigma^2 \leq \frac{(n-1)S^2}{\chi^2_{1- \frac{\alpha}{2}, n-1}}

Note that χγ,ν2\chi^2_{\gamma, \nu} is the 1γ1 - \gamma quantile of the χ2(ν)\chi^2(\nu) distribution.

Example

We are going to look at a sample of 20 residual flame times, in seconds, of treated specimens of children's nightwear, where a residual flame time measures how long it takes for something to burst into flames. Don't worry; there were no children in the nightwear during this experiment.

Here's the data:

9.859.939.759.779.679.879.679.949.859.759.839.929.749.999.889.959.959.939.929.89\begin{array}{ccccc} 9.85 & 9.93 &9.75 & 9.77 &9.67 \\ 9.87 & 9.67 &9.94 & 9.85 &9.75 \\ 9.83 & 9.92 &9.74 & 9.99 &9.88 \\ 9.95 & 9.95 &9.93 & 9.92 &9.89 \\ \end{array}

Let's compute a 95%95\% confidence interval for the mean residual flame time. After performing the necessary algebra, we can calculate the following statistics:

Xˉ=9.8475,S=0.0954\bar{X} = 9.8475, \quad S = 0.0954

Remember the equation for computing confidence intervals for μ\mu when σ2\sigma^2 is unknown:

Xˉntα/2,n1S2nμXˉn+tα/2,n1S2n\bar{X}_n - t_{\alpha/2,n-1}\sqrt{\frac{S^2}{n}} \leq \mu \leq \bar{X}_n + t_{\alpha/2,n-1}\sqrt{\frac{S^2}{n}}

Since we are constructing a 95%95\% confidence interval, α=0.05\alpha = 0.05 Additionally, since we have 20 samples, n=20n = 20. Let's plug in:

9.8475t0.025,190.0954220μ9.8475+t0.025,190.09542209.8475 - t_{0.025,19}\sqrt{\frac{0.0954^2}{20}} \leq \mu \leq 9.8475 + t_{0.025, 19}\sqrt{\frac{0.0954^2}{20}}

Now, we need to look up the 1γ=10.025=0.9751-\gamma = 1 - 0.025 = 0.975 quantile for the t(19)t(19) distribution. I found this table, which gives t0.025,19=2.093t_{0.025,19} = 2.093. Thus:

9.8475(2.093)0.0954)20μ9.8475+(2.093)0.0954)209.8475 - \frac{(2.093)0.0954)}{\sqrt{20}} \leq \mu \leq 9.8475 + \frac{(2.093)0.0954)}{\sqrt{20}}

After the final arithmetic, we see that the confidence interval is:

9.8029μ9.89219.8029 \leq \mu \leq 9.8921

We would say that we are 95% confident that the true mean, μ\mu, (of the underlying distribution from which our 2020 observations were sampled) is between 9.80299.8029 and 9.89219.8921.

OMSCS Notes is made with in NYC by Matt Schlenker.

Copyright © 2019-2023. All rights reserved.

privacy policy