Simulation

Calculus, Probability, and Statistics Primers, Continued

88 minute read



Notice a tyop typo? Please submit an issue or open a PR.

Calculus, Probability, and Statistics Primers, Continued

Conditional Expectation (OPTIONAL)

In this lesson, we are going to continue our exploration of conditional expectation and look at several cool applications. This lesson will likely be the toughest one for the foreseeable future, but don't panic!

Conditional Expectation Recap

Let's revisit the conditional expectation of YY given X=xX = x. The definition of this expectation is as follows:

E[YX=x]={yyf(yx)discreteRyf(yx)dycontinuousE[Y|X = x] = \left\{\begin{matrix} \sum_y y f(y|x) & \text{discrete} \\ \int_{\mathbb{R}} y f(y|x)dy & \text{continuous} \end{matrix}\right.

For example, suppose f(x,y) 21x2y/4f(x,y) \ 21x^2y / 4 for x2y1x^2 \leq y \leq 1. Then, by definition:

f(yx)=f(x,y)fX(x)f(y|x) = \frac{f(x,y)}{f_X(x)}

We calculated the marginal pdf, fX(x)f_X(x), previously, as the integral of f(x,y)f(x,y) over all possible values of y[x2,1]y \in [x^2, 1]. We can plug in fX(x)f_X(x) and f(x,y)f(x, y) below:

f(yx)=214x2y218(1x4)=2y1x4,x2y1f(y|x) = \frac{\frac{21}{4}x^2y}{\frac{21}{8}(1- x^4)} = \frac{2y}{1 - x^4}, \quad x^2 \leq y \leq 1

Given f(yx)f(y|x), we can now compute E[YX=x]E[Y | X = x]:

E[YX=x]=Ry(2y1x4)dyE[Y | X = x] = \int_{\mathbb{R}} y * \left(\frac{2y}{1 - x^4}\right)dy

We adjust the limits of integration to match the limits of yy:

E[YX=x]=x21y(2y1x4)dyE[Y | X = x] = \int_{x^2}^1 y * \left(\frac{2y}{1 - x^4}\right)dy

Now, complete the integration:

E[YX=x]=x212y21x4dyE[Y | X = x] = \int_{x^2}^1 \frac{2y^2}{1 - x^4}dy
E[YX=x]=21x4x21y2dyE[Y | X = x] = \frac{2}{1 - x^4} \int_{x^2}^1 y^2dy
E[YX=x]=21x4y33x21E[Y | X = x] = \frac{2}{1 - x^4} \frac{y^3}{3}\Big|^1_{x^2}
E[YX=x]=23(1x4)y3x21=2(1x6)3(1x4)E[Y | X = x] = \frac{2}{3(1 - x^4)} y^3\Big|^1_{x^2} = \frac{2(1 - x^6)}{3(1 - x^4)}

Double Expectations

We just looked at the expected value of YY given a particular value X=xX = x. Now we are going to average the expected value of YY over all values of XX. In other words, we are going to take the average expected value of all the conditional expected values, which will give us the overall population average for YY.

The theorem of double expectations states that the expected value of the expected value of YY given XX is the expected value of YY. In other words:

E[E(YX)]=E[Y]E[E(Y|X)] = E[Y]

Let's look at E[YX]E[Y|X]. We can use the formula that we used to calculate E[YX=x]E[Y|X=x] to find E[YX]E[Y|X], replacing xx with XX. Let's look back at our conditional expectation from the previous slide:

E[YX=x]=2(1x6)3(1x4)E[Y | X = x] = \frac{2(1 - x^6)}{3(1 - x^4)}

If we set X=XX = X, we get the following expression:

E[YX=X]=E[YX]=2(1X6)3(1X4)E[Y | X = X] = E[Y | X] = \frac{2(1 - X^6)}{3(1 - X^4)}

What does this mean? E[YX]E[Y|X] is itself a random variable that is a function of the random variable XX. Let's call this function hh:

h(X)=2(1X6)3(1X4)h(X) = \frac{2(1 - X^6)}{3(1 - X^4)}

We now have to calculate E[h(X)]E[h(X)], which we can accomplish using the definition of LOTUS:

E[h(X)]=Rh(x)fX(x)dxE[h(X)] = \int_{\mathbb{R}} h(x)f_X(x)dx

Let's substitute in for h(x)h(x) and h(X)h(X):

E[E[YX]]=RE(Yx)fX(x)dxE[E[Y|X]] = \int_{\mathbb{R}} E(Y|x)f_X(x)dx

Remember the definition for E[YX=x]E[Y|X = x]:

E[YX=x]={yyf(yx)discreteRyf(yx)dycontinuousE[Y|X = x] = \left\{\begin{matrix} \sum_y y f(y|x) & \text{discrete} \\ \int_{\mathbb{R}} y f(y|x)dy & \text{continuous} \end{matrix}\right.

Thus:

E[E[YX]]=R(Ryf(yx)dy)fX(x)dxE[E[Y|X]] = \int_{\mathbb{R}} \left(\int_{\mathbb{R}} y f(y|x)dy\right)f_X(x)dx

We can rearrange the right-hand side. Note that we can move yy outside of the first integral since it is a constant value when we integrate with respect to dxdx:

E[E[YX]]=RRyf(yx)fX(x)dxdy=RyRf(yx)fX(x)dxdyE[E[Y|X]] = \int_{\mathbb{R}} \int_{\mathbb{R}} y f(y|x)f_X(x)dx dy = \int_{\mathbb{R}} y \int_{\mathbb{R}} f(y|x)f_X(x)dx dy

Remember now the definition for the conditional pdf:

f(yx)=f(x,y)fX(x);f(yx)fX(x)=f(x,y)f(y|x) = \frac{f(x,y)}{f_X(x)}; \quad f(y|x)f_X(x) = f(x, y)

We can substitute in f(x,y)f(x,y) for f(yx)fX(x)f(y|x)f_X(x):

E[E[YX]]=RyRf(x,y)dxdyE[E[Y|X]] = \int_{\mathbb{R}} y \int_{\mathbb{R}} f(x,y)dx dy

Let's remember the definition for the marginal pdf of YY:

fY(y)=Rf(x,y)dxf_Y(y) = \int_{\mathbb{R}} f(x,y)dx

Let's substitute:

E[E[YX]]=RyfY(y)dyE[E[Y|X]] = \int_{\mathbb{R}} y f_Y(y)dy

Of course, the expected value of YY, E[Y]E[Y] equals:

E[Y]=RyfY(y)dyE[Y] = \int_{\mathbb{R}} y f_Y(y) dy

Thus:

E[E[YX]]=E[Y]E[E[Y|X]] = E[Y]

Example

Let's apply this theorem using our favorite joint pdf: f(x,y)=21x2y/4,x2y1f(x,y) = 21x^2y / 4, x^2 \leq y \leq 1. Through previous examples, we know fX(x)f_X(x), fY(y)f_Y(y) and E[Yx]E[Y|x]:

fX(x)=218x2(1x4)f_X(x) = \frac{21}{8}x^2(1-x^4)
fY(y)=72y52f_Y(y) = \frac{7}{2}y^{\frac{5}{2}}
E[Yx]=2(1x6)3(1x4)E[Y|x] = \frac{2(1 - x^6)}{3(1 - x^4)}

We are going to look at two ways to compute E[Y]E[Y]. First, we can just use the definition of expected value and integrate the product yFY(y)dyyF_Y(y)dy over the real line:

E[Y]=RyfY(y)dyE[Y] = \int_{\mathbb{R}} y f_Y(y)dy
E[Y]=01y72y52dyE[Y] = \int_0^1 y * \frac{7}{2}y^{\frac{5}{2}} dy
E[Y]=0172y72dyE[Y] = \int_0^1 \frac{7}{2}y^{\frac{7}{2}} dy
E[Y]=7201y72dyE[Y] = \frac{7}{2} \int_0^1 y^{\frac{7}{2}} dy
E[Y]=7229y9201=79y9201=79E[Y] = \frac{7}{2} \frac{2}{9}y^\frac{9}{2}\Big|_0^1 = \frac{7}{9} y^\frac{9}{2}\Big|_0^1 = \frac{7}{9}

Now, let's calculate E[Y]E[Y] using the double expectation theorem we just learned:

E[Y]=E[E(YX)]=RE(Yx)fX(x)dxE[Y] = E[E(Y|X)] = \int_{\mathbb{R}} E(Y|x) f_X(x)dx
E[Y]=112(1x6)3(1x4)×218x2(1x4)dxE[Y] = \int_{-1}^1 \frac{2(1 - x^6)}{3(1 - x^4)} \times \frac{21}{8}x^2(1-x^4) dx
E[Y]=422411(1x6)(1x4)×x2(1x4)dxE[Y] = \frac{42}{24}\int_{-1}^1 \frac{(1 - x^6)}{(1 - x^4)} \times x^2(1-x^4) dx
E[Y]=422411x2(1x6)dxE[Y] = \frac{42}{24}\int_{-1}^1 x^2(1 - x^6) dx
E[Y]=422411x2x8dxE[Y] = \frac{42}{24}\int_{-1}^1 x^2 - x^8 dx
E[Y]=4224(x33x99)11E[Y] = \frac{42}{24} \left(\frac{x^3}{3} - \frac{x^9}{9}\right) \Big|_{-1}^1
E[Y]=4224(3x3x99)11E[Y] = \frac{42}{24} \left(\frac{3x^3 - x^9}{9} \right) \Big|_{-1}^1
E[Y]=4224(31(3+1)9)=422449=79E[Y] = \frac{42}{24} \left(\frac{3 - 1 - (-3+1)}{9} \right) = \frac{42}{24} * \frac{4}{9} = \frac{7}{9}

Mean of the Geometric Distribution

In this application, we are going to see how we can use double expectation to calculate the mean of a geometric distribution.

Let YY equal the number of coin flips before a head, HH, appears, where P(H)=pP(H) = p. Thus, YY is distributed as a geometric random variable parameterized by pp: YGeom(p)Y \sim \text{Geom}(p). We know that the pmf of YY is fY(y)=P(Y=y)=(1p)y1p,y=1,2,...f_Y(y) = P(Y = y) = (1-p)^{y-1}p, y = 1,2,.... In other words, P(Y=y)P(Y = y) is the product of the probability of y1y-1 failures and the probability of one success.

Let's calculate the expected value of YY using the summation equation we've used previously (take the result on faith):

E[Y]=yyfY(y)=1y(1p)y1p=1pE[Y] = \sum_y y f_Y(y) = \sum_1^\infty y(1-p)^{y-1}p = \frac{1}{p}

Now we are going to use double expectation and a standard one-step conditioning argument to compute E[Y]E[Y]. First, let's define X=1X = 1 if the first flip is HH and X=0X = 0 otherwise. Let's pretend that we have knowledge of the first flip. We don't really have this knowledge, but we do know that the first flip can either be heads or tails: P(X=1)=p,P(X=0)=1pP(X = 1) = p, P(X = 0) = 1 - p.

Let's remember the double expectation formula:

E[Y]=E[E(YX)]=xE(Yx)fX(x)E[Y] = E[E(Y|X)] = \sum_x E(Y|x)f_X(x)

What are the xx-values? XX can only equal 00 or 11, so:

E[Y]=E(YX=0)P(X=0)+E(YX=1)P(X=1)E[Y] = E(Y|X = 0)P(X = 0) + E(Y|X=1)P(X=1)

Now, if X=0X= 0, the first flip was tails, and I have to start counting all over again. The expected number of flips I have to make before I see heads is E[Y]E[Y]. However, I have already flipped once, and I flipped tails: that's what X=0X = 0 means. So, the expected number of flips I need, given that I already flipped tails is 1+E[Y]1 + E[Y]: P(YX=0)=1+E[Y]P(Y|X=0) = 1 + E[Y] What is P(0)P(0)? It's just 1p1 - p. Thus:

E[YX=0]P(X=0)=(1+E[Y])(1p)E[Y|X = 0]P(X = 0) = (1 + E[Y])(1 - p)

Now, if X=1X = 1, the first flip was heads. I won! Given that X=1X = 1, the expected value of YY is one. If I know that I flipped heads on the first try, the expected number of trials before I flip heads is that one trial: P(YX=1)=1P(Y|X=1) = 1. What is P(1)P(1)? It's just pp. Thus:

E[YX=1]P(X=1)=(1)(p)=pE[Y|X = 1]P(X = 1) = (1)(p) = p

Let's solve for E[Y]E[Y]:

E[Y]=(1+E[Y])(1p)+pE[Y] = (1 + E[Y])(1 - p) + p
E[Y]=1+E[Y]ppE[Y]+pE[Y] = 1 + E[Y] -p -pE[Y] + p
E[Y]=1+E[Y]pE[Y]E[Y] = 1 + E[Y] - pE[Y]
pE[Y]=1;E[Y]=1ppE[Y] = 1; \quad E[Y] = \frac{1}{p}

Computing Probabilities by Conditioning

Let AA be some event. We define the random variable Y=1Y=1 if AA occurs, and Y=0Y = 0 otherwise. We refer to YY as an indicator function of AA; that is, the value of YY indicates the occurrence of AA. The expected value of YY is given by:

E[Y]=yyfY(y)dyE[Y] = \sum_y y f_Y(y)dy

Let's enumerate the yy-values:

E[Y]=0(P(Y=0))+1(P(Y=1))=P(Y=1)E[Y] = 0(P(Y = 0)) + 1(P(Y = 1)) = P(Y = 1)

What is P(Y=1)P(Y = 1)? Well, Y=1Y = 1 when AA occurs, so P(Y=1)=P(A)=E[Y]P(Y = 1) = P(A) = E[Y]. Indeed, the expected value of an indicator function is the probability of the corresponding event.

Similarly, for any random variable, XX, we have:

E[YX=x]=yyfY(yx)E[Y | X = x] = \sum_y y f_Y(y|x)

If we enumerate the yy-values, we have:

E[YX=x]=0(fY(Y=0X=x))+1(fY(Y=1X=x))=fY(Y=1X=x)\begin{alignedat}{1} E[Y | X = x] & = 0(f_Y(Y = 0|X= x)) + 1(f_Y(Y = 1|X = x)) \\[2ex] & = f_Y(Y = 1|X = x) \end{alignedat}

Since we know that f(Y=1)=P(A)f(Y = 1) = P(A), then:

E[Y=1X=x]=P(AX=1)E[Y = 1 | X = x] = P(A|X = 1)

Let's look at an implication of the above result. By definition:

P[A]=E[Y]=E[E(YX)]P[A] = E[Y] = E[E(Y | X)]

Using LOTUS:

P[A]=RE[YX=x]dFX(x)P[A] = \int_{\mathbb{R}} E[Y|X=x]dF_X(x)

Since we saw that E[YX=x]=P(AX=x)E[Y|X=x] = P(A|X=x), then:

P[A]=RP(AX=x)dFX(x)P[A] = \int_{\mathbb{R}} P(A|X=x)dF_X(x)

Theorem

The result above implies that, if XX and YY are independent, continuous random variables, then:

P(Y<X)=RP(Y<x)fX(x)dxP(Y < X) = \int_{\mathbb{R}} P(Y < x)f_X(x)dx

To prove, let A={Y<X}A = \{Y < X\}. Then:

P[A]=RP(AX=x)dFX(x)P[A] = \int_{\mathbb{R}} P(A|X=x)dF_X(x)

Substitute A={Y<X}A = \{Y < X\}:

P[A]=RP(Y<XX=x)dFX(x)P[A] = \int_{\mathbb{R}} P(Y < X|X=x)dF_X(x)

What's P(Y<XX=x)P(Y < X|X=x)? In other words, for a given X=xX = x, what's the probability that Y<XY < X? That's a long way of saying P(Y<x)P(Y < x):

P[A]=RP(Y<x)dFX(x)P[A] = \int_{\mathbb{R}} P(Y < x)dF_X(x)
P[A]=P[Y<X]=RP(Y<x)fX(x)dx,Fx(x)=fX(x)dxP[A] = P[Y < X] = \int_{\mathbb{R}} P(Y < x)f_X(x)dx, \quad F_x'(x) = f_X(x)dx

Example

Suppose we have two random variables, XExp(μ)X \sim \text{Exp}(\mu) and YExp(λ)Y \sim \text{Exp}(\lambda). Then:

P(Y<X)=RP(Y<x)fX(x)dxP(Y < X) = \int_{\mathbb{R}} P(Y < x)f_X(x)dx

Note that P(Y<x)P(Y < x) is the cdf of YY at xx: FY(x)F_Y(x). Thus:

P(Y<X)=RFY(x)fX(x)dxP(Y < X) = \int_{\mathbb{R}} F_Y(x)f_X(x)dx

Since XX and YY are both exponentially distributed, we know that they have the following pdf and cdf, by definition:

f(x;λ)=λeλx,x0f(x; \lambda) = \lambda e^{-\lambda x}, x \geq 0
F(x;λ)=1eλx,x0F(x; \lambda) = 1 - e^{-\lambda x}, x \geq 0

Let's substitute these values in, adjusting the limits of integration appropriately:

P(Y<X)=01eλx(μeμx)dxP(Y < X) = \int_0^\infty 1 - e^{-\lambda x}(\mu e^{-\mu x})dx

Let's rearrange:

P(Y<X)=μ0eμxeλxμxdxP(Y < X) = \mu \int_0^\infty e^{-\mu x} - e^{-\lambda x - \mu x} dx
P(Y<X)=μ[0eμxdx0eλxμxdx]P(Y < X) = \mu \left[\int_0^\infty e^{-\mu x} dx - \int_0^\infty e^{-\lambda x - \mu x} dx\right]

Let u1=μxu_1 = -\mu x. Then du1=μdxdu_1 = -\mu dx. Let u2=λxμxu_2 = -\lambda x - \mu x. Then du2=(λ+μ)dxdu_2 = -(\lambda + \mu)dx. Thus:

P(Y<X)=μ[0eu1μdu1+0eu2λ+μdu2]P(Y < X) = \mu \left[-\int_0^\infty \frac{e^{u_1}}{\mu} du_1 + \int_0^\infty \frac{e^{u_2}}{\lambda + \mu} du_2\right]

Now we can integrate:

P(Y<X)=μ[0eu2λ+μdu20eu1μdu1]P(Y < X) = \mu \left[\int_0^\infty \frac{e^{u_2}}{\lambda + \mu} du_2 - \int_0^\infty \frac{e^{u_1} }{\mu}du_1 \right]
P(Y<X)=μ[eu2λ+μeu1μ]0P(Y < X) = \mu \left[\frac{e^{u_2}}{\lambda + \mu} - \frac{e^{u_1}}{\mu} \right]_0^\infty
P(Y<X)=μ[eλxμxλ+μeμxμ]0P(Y < X) = \mu \left[\frac{e^{-\lambda x - \mu x}}{\lambda + \mu} - \frac{e^{-\mu x}}{\mu} \right]_0^\infty
P(Y<X)=μ[01λ+μ+1μ]P(Y < X) = \mu \left[0 - \frac{1}{\lambda + \mu} + \frac{1}{\mu} \right]
P(Y<X)=μ[1μ1λ+μ]P(Y < X) = \mu \left[\frac{1}{\mu} - \frac{1}{\lambda + \mu} \right]
P(Y<X)=μμμλ+μP(Y < X) = \frac{\mu}{\mu} - \frac{\mu}{\lambda + \mu}
P(Y<X)=λ+μλ+μμλ+μ=λλ+μP(Y < X) = \frac{\lambda + \mu}{\lambda + \mu} - \frac{\mu}{\lambda + \mu} = \frac{\lambda}{\lambda + \mu}

As it turns out, this result makes sense because XX and YY correspond to arrivals from a poisson process and μ\mu and λ\lambda are the arrival rates. For example, suppose that XX corresponds to arrival times for women to a store, and YY corresponds to arrival times for men. If women are coming in at a rate of three per hour - λ=3\lambda = 3 - and men are coming in at a rate of nine per hour - μ=9\mu = 9 - then the probability of a woman arriving before a man is going to be 3/43/4.

Variance Decomposition

Just as we can use double expectation for the expected value of YY, we can express the variance of YY, Var(Y)\text{Var}(Y) in a similar fashion, which we refer to as variance decomposition:

Var(Y)=E[Var(YX)]+Var[E(YX)]\text{Var}(Y) = E[\text{Var}(Y|X)] + \text{Var}[E(Y|X)]

Proof

Let's start with the first term: E[Var(YX)]E[\text{Var}(Y|X)]. Remember the definition of variance, as the second central moment:

Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2

Thus, we can express E[Var(YX)]E[\text{Var}(Y|X)] as:

E[Var(YX)]=E[E[Y2X](E[YX])2]E[\text{Var}(Y|X)] = E[E[Y^2 | X] - (E[Y|X])^2]

Note that, since expectation is linear:

E[Var(YX)]=E[E[Y2X]]E[(E[YX])2]E[\text{Var}(Y|X)] = E[E[Y^2 | X]] - E[(E[Y|X])^2]

Notice the first expression on the right-hand side. That's a double expectation, and we know how to simplify that:

E[Var(YX)]=E[Y2]E[(E[YX])2],1.E[\text{Var}(Y|X)] = E[Y^2] - E[(E[Y|X])^2], \quad 1.

Now let's look at the second term in the variance decomposition: Var[E(YX)]\text{Var}[E(Y|X)]. Considering again the definition for variance above, we can transform this term:

Var[E(YX)]=E[(E[YX)2](E[E[YX]])2\text{Var}[E(Y|X)] = E[(E[Y | X)^2] - (E[E[Y|X]])^2

In this equation, we again see a double expectation, quantity squared. So:

Var[E(YX)]=E[(E[YX)2]E[Y]2,2.\text{Var}[E(Y|X)] = E[(E[Y| X)^2] - E[Y]^2, \quad 2.

Remember the equation for variance decomposition:

Var(Y)=E[Var(YX)]+Var[E(YX)]\text{Var}(Y) = E[\text{Var}(Y|X)] + \text{Var}[E(Y|X)]

Let's plug in 11 and 22 for the first and second term, respectively:

Var(Y)=E[Y2]E[(E[YX])2]+E[(E[YX)2]E[Y]2\text{Var}(Y) =E[Y^2] - E[(E[Y|X])^2] + E[(E[Y | X)^2] - E[Y]^2

Notice the cancellation of the two scary inner terms to reveal the definition for variance:

Var(Y)=E[Y2]E[Y]2=Var(Y)\text{Var}(Y) = E[Y^2] - E[Y]^2 = \text{Var}(Y)

Covariance and Correlation

In this lesson, we are going to talk about independence, covariance, correlation, and some related results. Correlation shows up all over the place in simulation, from inputs to outputs to everywhere in between.

LOTUS in 2D

Suppose that h(X,Y)h(X,Y) is some function of two random variables, XX and YY. Then, via LOTUS, we know how to calculate the expected value, E[h(X,Y)]E[h(X,Y)]:

E[h(X,Y)]={xyh(x,y)f(x,y)if (X,Y) is discreteRRh(x,y)f(x,y)dxdyif (X,Y) is continuousE[h(X,Y)] = \left\{\begin{matrix} \sum_x \sum_y h(x,y)f(x,y) & \text{if (X,Y) is discrete} \\ \int_{\mathbb{R}} \int_{\mathbb{R}} h(x,y)f(x,y)dx dy & \text{if (X,Y) is continuous} \\ \end{matrix}\right.

Expected Value, Variance of Sum

Whether or not XX and YY are independent, the sum of the expected values equals the expected value of the sum:

E[X+Y]=E[X]+E[Y]E[X+Y] = E[X] + E[Y]

If XX and YY are independent, then the sum of the variances equals the variance of the sum:

Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)

Note that we need the equations for LOTUS in two dimensions to prove both of these theorems.

Aside: I tried to prove these theorems. It went terribly! Check out the proper proofs here.

Random Sample

Let's suppose we have a set of nn random variables: X1,...,XnX_1,...,X_n. This set is said to form a random sample from the pmf/pdf f(x)f(x) if all the variables are (i) independent and (ii) each XiX_i has the same pdf/pmf f(x)f(x).

We can use the following notation to refer to such a random sample:

X1,...,Xniidf(x)X_1,...,X_n \overset{\text{iid}}{\sim} f(x)

Note that "iid" means "independent and identically distributed", which is what (i) and (ii) mean, respectively, in our definition above.

Theorem

Given a random sample, X1,...,Xniidf(x)X_1,...,X_n \overset{\text{iid}}{\sim} f(x), the sample mean, Xnˉ\bar{X_n} equals the following:

Xnˉi=1nXin\bar{X_n} \equiv \sum_{i =1}^n \frac{X_i}{n}

Given the sample mean, the expected value of the sample mean is the expected value of any of the individual variables, and the variance of the sample mean is the variance of any of the individual variables divided by nn:

E[Xnˉ]=E[Xi];Var(Xnˉ)=Var(Xi)/nE[\bar{X_n}] =E[X_i]; \quad\text{Var}(\bar{X_n}) = \text{Var}(X_i) / n

We can observe that as nn increases, E[Xnˉ]E[\bar{X_n}] is unaffected, but Var(Xnˉ)\text{Var}(\bar{X_n}) decreases.

Covariance

Covariance is one of the most fundamental measures of non-independence between two random variables. The covariance between XX and YY, Cov(X,Y)\text{Cov}(X, Y) is defined as:

Cov(X,Y)E[(XE[X])(YE[Y])]\text{Cov}(X,Y) \equiv E[(X-E[X])(Y - E[Y])]

The right-hand side of this equation looks daunting, so let's see if we can simplify it. We can first expand the product:

E[(XE[X])(YE[Y])=E[XYXE[Y]YE[X]+E[Y]E[X]]\begin{alignedat}{1} & E[(X-E[X])(Y - E[Y]) = \\ & E[XY - XE[Y] - YE[X] + E[Y]E[X]] \end{alignedat}

Since expectation is linear, we can rewrite the right-hand side as a difference of expected values:

E[(XE[X])(YE[Y])=E[XY]E[XE[Y]]E[YE[X]]+E[E[Y]E[X]]\begin{alignedat}{1} & E[(X-E[X])(Y - E[Y]) = \\ & E[XY] - E[XE[Y]] - E[YE[X]] + E[E[Y]E[X]] \end{alignedat}

Note that both E[X]E[X] and E[Y]E[Y] are just numbers: the expected values of the corresponding random variables. As a result, we can apply two principles here: E[aX]=aE[X]E[aX] = aE[X] and E[a]=aE[a] = a. Consider the following rearrangement:

E[(XE[X])(YE[Y])=E[XY]E[Y]E[X]E[X]E[Y]+E[Y]E[X]\begin{alignedat}{1} & E[(X-E[X])(Y - E[Y]) = \\ & E[XY] - E[Y]E[X] - E[X]E[Y] + E[Y]E[X] \end{alignedat}

The last three terms are the same, they and sum to E[Y]E[X]-E[Y]E[X]. Thus:

Cov(X,Y)E[(XE[X])(YE[Y])]=E[XY]E[Y]E[X]\begin{alignedat}{1} \text{Cov}(X,Y) & \equiv E[(X-E[X])(Y - E[Y])] \\[2ex] & = E[XY] - E[Y]E[X] \end{alignedat}

This equation is much easier to work with; namely, h(X,Y)=XYh(X,Y) = XY is a much simpler function than h(X,Y)=(XE[X])(YE[Y])h(X,Y) = (X-E[X])(Y - E[Y]) when it comes time to apply LOTUS.

Let's understand what happens when we take the covariance of XX with itself:

Cov(X,X)=E[XX]E[X]E[X]=E[X2](E[X])2=Var(X)\begin{alignedat}{1} \text{Cov}(X,X) & = E[X * X] - E[X]E[X] \\[2ex] & = E[X^2] - (E[X])^2 \\[2ex] & = \text{Var}(X) \end{alignedat}

Theorem

If XX and YY are independent random variables, then Cov(X,Y)=0\text{Cov}(X, Y) = 0. On the other hand, a covariance of 00 does not mean that XX and YY are independent.

For example, consider two random variables, XUnif(1,1)X \sim \text{Unif}(-1,1) and Y=X2Y = X^2. Since YY is a function of XX, the two random variables are dependent: if you know XX, you know YY. However, take a look at the covariance:

Cov(X,Y)=E[X3]E[X]E[X2]\text{Cov}(X, Y) = E[X^3] - E[X]E[X^2]

What is E[X]E[X]? Well, we can integrate the pdf from 1-1 to 11, or we can understand that the expected value of a uniform random variable is the average of the bounds of the distribution. That's a long way of saying that E[X]=(1+1)/2=0E[X] =(-1 + 1) / 2 = 0.

Now, what is E[X3]E[X^3]? We can apply LOTUS:

E[X3]=11x3f(x)dxE[X^3] = \int_{-1}^1 x^3f(x)dx

What is the pdf of a uniform random variable? By definition, it's one over the difference of the bounds:

E[X3]=11111x3f(x)dxE[X^3] = \frac{1}{1 - - 1}\int_{-1}^1 x^3f(x)dx

Let's integrate and evaluate:

E[X3]=12x4411=148(1)48=0E[X^3] = \frac{1}{2} \frac{x^4}{4}\Big|_{-1}^1 = \frac{1^4}{8} - \frac{(-1)^4}{8} = 0

Thus:

Cov(X,Y)=E[X3]E[X]E[X2]=0\text{Cov}(X, Y) = E[X^3] - E[X]E[X^2] = 0

Just because the covariance between XX and YY is 00 does not mean that they are independent!

More Theorems

Suppose that we have two random variables, XX and YY, as well as two constants, aa and bb. We have the following theorem:

Cov(aX,bY)=abCov(X,Y)\text{Cov}(aX, bY) = ab\text{Cov}(X,Y)

Whether or not XX and YY are independent,

Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)
Var(XY)=Var(X)+Var(Y)2Cov(X,Y)\text{Var}(X - Y) = \text{Var}(X) + \text{Var}(Y) - 2\text{Cov}(X, Y)

Note that we looked at a theorem previously which gave an equation for the variance of X+YX + Y when both variables are independent: Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y). That equation was a special case of the theorem above, where Cov(X,Y)=0\text{Cov}(X,Y) = 0 as is the case between two independent random variables.

Correlation

The correlation between XX and YY, ρ\rho, is equal to:

ρCov(X,Y)Var(X)Var(Y)\rho \equiv \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}}

Note that correlation is standardized covariance. In other words, for any XX and YY, 1ρ1-1 \leq \rho \leq 1.

If two variables are highly correlated, then ρ\rho will be close to 11. If two variables are highly negatively correlated, then ρ\rho will be close to 1-1. Two variables with low correlation will have a ρ\rho close to 00.

Example

Consider the following joint pmf:

f(x,y)X=2X=3X=4fY(y)Y=400.000.200.100.3Y=500.150.100.050.3Y=600.300.000.100.4fX(x)0.450.300.251\begin{array}{c|ccc|c} f(x,y) & X = 2 & X = 3 & X = 4 & f_Y(y) \\ \hline Y = 40 & 0.00 & 0.20 & 0.10 & 0.3 \\ Y = 50 & 0.15 & 0.10 & 0.05 & 0.3 \\ Y = 60 & 0.30 & 0.00 & 0.10 & 0.4 \\ \hline f_X(x) & 0.45 & 0.30 & 0.25 & 1 \\ \end{array}

For this pmf, XX can take values in {2,3,4}\{2, 3, 4\} and YY can take values in {40,50,60}\{40, 50, 60\}. Note the marginal pmfs along the table's right and bottom, and remember that all pmfs sum to one when calculated over all appropriate values.

What is the expected value of XX? Let's use fX(x)f_X(x):

E[X]=2(0.45)+3(0.3)+4(0.25)=2.8E[X] = 2(0.45) + 3(0.3) + 4(0.25) = 2.8

Now let's calculate the variance:

Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2
Var(X)=4(0.45)+9(0.3)+16(0.25)(2.8)2=0.66\text{Var}(X) = 4(0.45) + 9(0.3) + 16(0.25) - (2.8)^2 = 0.66

What is the expected value of YY? Let's use fY(y)f_Y(y):

E[Y]=40(0.3)+50(0.3)+60(0.4)=51E[Y] = 40(0.3) + 50(0.3) + 60(0.4) = 51

Now let's calculate the variance:

Var(Y)=E[Y2](E[Y])2\text{Var}(Y) = E[Y^2] - (E[Y])^2
Var(X)=1600(0.3)+2500(0.3)+3600(0.4)(51)2=69\text{Var}(X) = 1600(0.3) + 2500(0.3) + 3600(0.4) - (51)^2 = 69

If we want to calculate the covariance of XX and YY, we need to know E[XY]E[XY], which we can calculate using two-dimensional LOTUS:

E[XY]=xyxyf(x,y)E[XY] = \sum_x \sum_y xy f(x,y)
E[XY]=(2400.00)+(2500.15)+...+(4600.1)=140E[XY] = (2 * 40 * 0.00) + (2 * 50 * 0.15) + ... + (4 * 60 * 0.1) = 140

With E[XY]E[XY] in hand, we can calculate the covariance of XX and YY:

Cov(X,Y)=E[XY]E[X]E[Y]=140(2.851)=2.8\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = 140 - (2.8 * 51) = -2.8

Finally, we can calculate the correlation:

ρ=Cov(X,Y)Var(X)Var(Y)\rho = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}}
ρ=2.80.66(69)0.415\rho = \frac{-2.8}{\sqrt{0.66(69)}} \approx -0.415

Portfolio Example

Let's look at two different assets, S1S_1 and S2S_2, that we hold in our portfolio. The expected yearly returns of the assets are E[S1]=μ1E[S_1] = \mu_1 and E[S2]=μ2E[S_2] = \mu_2, and the variances are Var(S1)=σ12\text{Var}(S_1) = \sigma_1^2 and Var(S2)=σ22\text{Var}(S_2) = \sigma_2^2. The covariance between the assets is σ12\sigma_{12}.

A portfolio is just a weighted combination of assets, and we can define our portfolio, PP, as:

P=wS1+(1w)S2,w[0,1]P = wS_1 + (1 - w)S_2, \quad w \in [0,1]

The portfolio's expected value is the sum of the expected values of the assets times their corresponding weights:

E[P]=E[wS1+(1w)S2]E[P] = E[wS_1 + (1 - w)S_2]
E[P]=E[wS1]+E[(1w)S2]E[P] = E[wS_1] + E[(1 - w)S_2]
E[P]=wE[S1]+(1w)E[S2]E[P] = wE[S_1] + (1 - w)E[S_2]
E[P]=wμ1+(1w)μ2E[P] = w\mu_1 + (1-w)\mu_2

Let's calculate the variance of the portfolio:

Var(P)=Var(wS1+(1w)S2)\text{Var}(P) = \text{Var}(wS_1 + (1-w)S_2)

Remember how we express Var(X+Y)\text{Var}(X + Y):

Var(P)=Var(wS1)+Var((1w)S2)+2Cov(wS1,(1w)S2)\text{Var}(P) = \text{Var}(wS_1) + \text{Var}((1-w)S_2) + 2\text{Cov}(wS_1, (1-w)S_2)

Remember that Var(aX)=a2Var(X)\text{Var}(aX) = a^2\text{Var}(X) and Cov(aX,bY)=abCov(X,Y)\text{Cov}(aX, bY) = ab\text{Cov}(X,Y). Thus:

Var(P)=w2Var(S1)+(1w)2Var(S2)+2w(1w)Cov(S1,S2)\text{Var}(P) = w^2\text{Var}(S_1) + (1-w)^2\text{Var}(S_2) + 2w(1-w)\text{Cov}(S_1, S_2)

Finally, let's substitute in the appropriate variables:

Var(P)=w2σ12+(1w)2σ22+2w(1w)σ12\text{Var}(P) = w^2\sigma^2_1 + (1-w)^2\sigma^2_2 + 2w(1-w)\sigma_{12}

How might we optimize this portfolio? One thing we might want to optimize for is minimal variance: many people want their portfolios to have as little volatility as possible.

Let's recap. Given a function f(x)f(x), how do we find the xx that minimizes f(x)f(x)? We can take the derivative, f(x)f'(x), set it to 00 and then solve for xx. Let's apply this logic to Var(P)\text{Var}(P). First, we take the derivative with respect to ww:

ddwVar(P)=2wσ122(1w)σ22+2σ124wσ12\frac{d}{dw}\text{Var}(P) = 2w\sigma^2_1 - 2(1-w)\sigma^2_2 + 2\sigma_{12} - 4w\sigma_{12}
ddwVar(P)=2wσ122σ22+2wσ22+2σ124wσ12\frac{d}{dw}\text{Var}(P) = 2w\sigma^2_1 - 2\sigma^2_2 +2w\sigma^2_2 + 2\sigma_{12} - 4w\sigma_{12}

Then, we set the derivative equal to 00 and solve for ww:

0=2wσ122σ22+2wσ22+2σ124wσ120 = 2w\sigma^2_1 - 2\sigma^2_2 +2w\sigma^2_2 + 2\sigma_{12} - 4w\sigma_{12}
0=wσ12σ22+wσ22+σ122wσ120 = w\sigma^2_1 - \sigma^2_2 +w\sigma^2_2 + \sigma_{12} - 2w\sigma_{12}
σ22σ12=wσ12+wσ222wσ12\sigma^2_2 - \sigma_{12} = w\sigma^2_1 +w\sigma^2_2 - 2w\sigma_{12}
σ22σ12=w(σ12+σ222σ12)\sigma^2_2 - \sigma_{12} = w(\sigma^2_1 +\sigma^2_2 - 2\sigma_{12})
σ22σ12σ12+σ222σ12=w\frac{\sigma^2_2 - \sigma_{12}}{\sigma^2_1 +\sigma^2_2 - 2\sigma_{12}} = w

Example

Suppose E[S1]=0.2E[S_1] = 0.2, E[S2]=0.1E[S_2] = 0.1, Var(S1)=0.2\text{Var}(S_1) = 0.2, Var(S2)=0.4\text{Var}(S_2) = 0.4, and Cov(S1,S2)=0.1\text{Cov}(S_1, S_2) = -0.1.

What value of ww maximizes the expected return of this portfolio? We don't even have to do any math: just allocate 100% of the portfolio to the asset with the higher expected return - S1S_1. Since we define our portfolio as wS1+(1w)S2wS_1 + (1 - w)S_2, the correct value for ww is 11.

What value of ww minimizes the variance? Let's plug and chug:

w=σ22σ12σ12+σ222σ12w = \frac{\sigma^2_2 - \sigma_{12}}{\sigma^2_1 +\sigma^2_2 - 2\sigma_{12}}
w=0.4+0.10.2+0.4+0.2=0.5/0.8=0.625w = \frac{0.4 + 0.1}{0.2 + 0.4 + 0.2} = 0.5 / 0.8 = 0.625

To minimize variance, we should hold a portfolio consisting of 5/85/8 S1S_1 and 3/83/8 S2S_2.

There are tradeoffs in any optimization. For example, optimizing for maximal expected return may introduce high levels of volatility into the portfolio. Conversely, optimizing for minimal variance may result in paltry returns.

Probability Distributions

In this lesson, we are going to review several popular discrete and continuous distributions.

Bernoulli (Discrete)

Suppose we have a random variable, XBernoulli(p)X \sim \text{Bernoulli}(p). XX has the following pmf:

f(x)={pif x=11p(=q)if x=0f(x) = \left\{\begin{matrix} p & \text{if } x = 1 \\ 1 - p (= q) & \text{if } x = 0 \end{matrix}\right.

Additionally, XX has the following properties:

E[X]=pE[X] = p
Var(X)=pq\text{Var}(X) = pq
MX(t)=pet+qM_X(t) = pe^t + q

Binomial (Discrete)

The Bernoulli distribution generalizes to the binomial distribution. Suppose we have nn iid Bernoulli random variables: X1,X2,...,XniidBern(p)X_1,X_2,...,X_n \overset{\text{iid}}\sim \text{Bern}(p). Each XiX_i takes on the value 11 with probability pp and 00 with probability 1p1-p. If we take the sum of the successes, we have the following random variable, YY:

Y=i=1nXiBin(n,p)Y = \sum_{i = 1}^n X_i \sim \text{Bin}(n,p)

YY has the following pmf:

f(y)=(ny)pyqny,y=0,1,...,n.f(y) = \binom{n}{y}p^yq^{n-y}, \quad y = 0, 1,...,n.

Notice the binomial coefficient in this equation. We read this as "n choose k", which is defined as:

(ny)=n!k!(nk)!\binom{n}{y} = \frac{n!}{k!(n-k)!}

What's going on here? First, what is the probability of yy successes? Well, completely, it's the probability of yy successes and nyn-y failures: pyqnyp^yq^{n-y}. Of course, the outcome of yy consecutive successes followed by nyn-y consecutive failures is just one particular arrangement of many. How many? nn choose kk. This is what the binomial coefficient expresses.

Additionally, YY has the following properties:

E[Y]=npE[Y] = np
Var(Y)=npq\text{Var}(Y) = npq
MY(t)=(pet+q)nM_Y(t) = (pe^t + q)^n

Note that the variance and the expected value are equal to nn times the variance and the expected value of the Bernoulli random variable. This relationship makes sense: a binomial random variable is the sum of nn Bernoulli's. The moment-generating function looks a little bit different. As it turns out, we multiply the moment-generating functions when we sum the random variables.

Geometric (Discrete)

Suppose we have a random variable, XGeometric(p)X \sim \text{Geometric}(p). A geometric random variable corresponds to the number of Bern(p)\text{Bern}(p) trials until a success occurs. For example, three failures followed by a success ("FFFS") implies that X=4X = 4. A geometric random variable has the following pmf:

f(x)=qx1p,x=1,2,...f(x) = q^{x-1}p, \quad x = 1,2,...

We can see that this equation directly corresponds to the probability of x1x - 1 failures, each with probability qq followed by one success, with probability pp.

Additionally, XX has the following properties:

E[X]=1pE[X] = \frac{1}{p}
Var(X)=qp2\text{Var}(X) = \frac{q}{p^2}
MX(t)=pet1qetM_X(t) = \frac{pe^t}{1-qe^t}

Negative Binomial (Discrete)

The geometric distribution generalizes to the negative binomial distribution. Suppose that we are interested in the number of trials it takes to see rr successes. We can add rr iid Geom(p)\text{Geom}(p) random variables to get the random variable YNegBin(r,p)Y \sim \text{NegBin}(r, p). For example, if r=3r = 3, then a run of "FFFSSFS" implies that YNegBin(3,p)=7Y \sim \text{NegBin}(3, p) = 7. YY has the following pmf:

f(y)=(y1r1)qyrpr,y=r,r+1,...f(y) = \binom{y-1}{r-1}q^{y-r}p^{r}, \quad y = r, r + 1,...

Additionally, YY has the following properties:

E[Y]=rpE[Y] = \frac{r}{p}
Var(Y)=qrp2\text{Var}(Y) = \frac{qr}{p^2}

Note that the variance and the expected value are equal to rr times the variance and the expected value of the geometric random variable. This relationship makes sense: a negative binomial random variable is the sum of rr geometric random variables.

Poisson (Discrete)

A counting process, N(t)N(t) keeps track of the number of "arrivals" observed between time 00 and time tt. For example, if 77 people show up to a store by time t=3t=3, then N(3)=7N(3) = 7. A Poisson process is a counting process that satisfies the following criteria.

  1. Arrivals must occur one-at-a-time at a rate, λ\lambda. For example, λ=4/hr\lambda = 4 / \text{hr} means that, on average, arrivals occur every fifteen minutes, yet no two arrivals coincide.
  2. Disjoint time increments are independent. Suppose we are looking at arrivals on the intervals 12 am - 2 am and 5 am - 10 am. Independent increments means that the arrivals in the first interval don't impact arrivals in the second.
  3. Increments are stationary; in other words, the distribution of the number of arrivals in the interval [s,s+t][s, s + t] depends only on the interval's length, tt. It does not depend on where the interval starts, ss.

A random variable XPois(λ)X \sim \text{Pois}(\lambda) describes the number of arrivals that a Poisson process experiences in one time unit, i.e., N(1)N(1). XX has the following pmf:

f(x)=eλλxx!,x=0,1,...f(x) = \frac{e^{-\lambda}\lambda^x}{x!}, \quad x = 0,1,...

Additionally, XX has the following properties:

E[X]=Var(X)=λE[X] = \text{Var}(X) = \lambda
MX(t)=eλ(et1)M_X(t) = e^{\lambda(e^t - 1)}

Uniform (Continuous)

A uniform random variable, XUniform(a,b)X \sim \text{Uniform}(a,b), has the following pdf:

f(x)=1ba,axbf(x) = \frac{1}{b - a}, \quad a \leq x \leq b

Additionally, XX has the following properties:

E[X]=a+b2E[X] = \frac{a + b}{2}
Var(X)=(ba)212\text{Var}(X) = \frac{(b-a)^2}{12}
MX(t)=etbetatbtaM_X(t) = \frac{e^{tb} - e^{ta}}{tb - ta}

Exponential (Continuous)

A continuous, exponential random variable XExponential(λ)X \sim \text{Exponential}(\lambda) has the following pdf:

f(x)=λeλx,x0f(x) = \lambda e^{-\lambda x}, \quad x \geq 0

Additionally, XX has the following properties:

E[X]=1λE[X] = \frac{1}{\lambda}
Var(X)=1λ2\text{Var}(X) = \frac{1}{\lambda^2}
MX(t)=λλt,t<λM_X(t) = \frac{\lambda}{\lambda - t}, \quad t < \lambda

The exponential distribution also has a memoryless property, which means that for s,t>0s, t > 0, P(X>s+tX>s)=P(X>t)P(X > s + t | X > s) = P(X > t). For example, if we have a light bulb, and we know that it has lived for ss time units, the conditional probability that it will live for s+ts + t