Simulation

Random Variate Generation, Continued

63 minute read



Notice a tyop typo? Please submit an issue or open a PR.

Random Variate Generation, Continued

Composition (OPTIONAL)

In this lesson, we will learn about the composition method. This technique only works for random variables that are themselves combinations or mixtures of random variables, but it works well when we can use it.

Composition

We use composition when working with a random variable that comes from a combination of two random variables and can be decomposed into its constituent parts.

Let's consider a concrete example. An airplane might be late leaving the gate for two reasons, air traffic delays and maintenance delays, and these two delays compose the overall delay time. Most of the time, the total delay is due to a small air traffic delay, while, less often, the total delay is due to a larger maintenance delay.

If we can break down the total delay into the component delays - air traffic and maintenance - and those components are easy to generate, we can produce an expression for the total delay that combines the individual pieces.

The goal is to generate a random variable with a cdf, F(x)F(x), of the following form:

F(x)=j=1pjFj(x)F(x) = \sum_{j=1}^\infty p_jF_j(x)

In plain English, we want to generate a random variable whose cdf is a linear combination of other, "easier" cdf's, the FjF_j's. Note that pj>0p_j > 0 and the sum of all pjp_j's equals one.

Again, the main idea here is that, instead of trying to generate one complicated random variable, we can generate a collection of simpler random variables, weight them accordingly, and then sum them to realize the single random variable.

Don't fixate on the upper limit being infinity in the summation expression. Infinity is just the theoretical upper limit. In our airplane example, the actual limit was 2.

Here's the algorithm that we will use. First, we generate a positive integer JJ such that P(J=j)=pjP(J=j) = p_j for all jj. Then, we return XX from cdf FJ(x)F_J(x).

Proof

Let's look at a proof that XX has cdf F(x)F(x). First, we know that the cdf of XX, by definition, is P(Xx)P(X \leq x). Additionally, by the law of total probability:

P(Xx)=j=1P(XxJ=j)P(J=j)P(X \leq x) = \sum_{j=1}^\infty P(X \leq x | J=j)P(J=j)

Now, we expressed P(J=j)P(J=j) as pjp_j, and we know that P(XxJ=j)P(X \leq x|J=j) is equal to FjF_j. Therefore:

P(Xx)=j=1P(XxJ=j)P(J=j)=j=1Fj(x)pj=F(x)\begin{alignedat}{1} P(X \leq x) &= \sum_{j=1}^\infty P(X \leq x | J=j)P(J=j) \\ &= \sum_{j=1}^\infty F_j(x)p_j = F(x) \end{alignedat}

Example

Consider the Laplace distribution, with the following pdf and cdf:

f(x){12ex,x<012ex,x>0andF(x){12ex,x<0112ex,x>0f(x) \equiv \left\{ \begin{matrix} \frac{1}{2}e^x, & x < 0 \\ \frac{1}{2}e^{-x}, & x > 0 \\ \end{matrix} \right. \quad \text{and} \quad F(x) \equiv \left\{ \begin{matrix} \frac{1}{2}e^x, & x < 0 \\ 1 - \frac{1}{2}e^{-x}, & x > 0 \\ \end{matrix} \right. \quad

Note that the pdf and cdf look similar to those for exponential random variables, except that xx is allowed to be less than zero and the exponential has a coefficient of one-half. We can think of this distribution as the exponential distribution reflected off the yy-axis.

Let's decompose this XX into a "negative exponential" and regular exponential distributions:

F1(x){ex,x<01,x>0andF2(x){0,x<01ex,x>0F_1(x) \equiv \left\{ \begin{matrix} e^x, & x < 0 \\ 1, & x > 0 \\ \end{matrix} \right. \quad \text{and} \quad F_2(x) \equiv \left\{ \begin{matrix} 0, & x < 0 \\ 1 - e^{-x}, & x > 0 \\ \end{matrix} \right. \quad

If we multiply each CDF here by one-half and sum them together, we get a Laplace random variable:

F(x)=12F1(x)+12F2(x)F(x) = \frac{1}{2}F_1(x) + \frac{1}{2}F_2(x)

Let's look at both cases. When x<0x < 0:

F(x)=12ex+12(0)=12exF(x) = \frac{1}{2}e^x + \frac{1}{2}(0) = \frac{1}{2}e^x

When x>0x > 0:

F(x)=12(1)+12(1ex)=112exF(x) = \frac{1}{2}(1) + \frac{1}{2}(1-e^{-x}) = 1 - \frac{1}{2}e^x

As we can see, the composed cdf matches the expected cdf, both for x>0x > 0 and x<0x < 0.

The cdf tells us that we generate XX from F1(x)F_1(x) half the time and from F2(x)F_2(x) the other half of the time. Correspondingly, we'll use inverse transform to solve F1(X)=eX=UF_1(X) = e^X = U for XX half the time and we'll solve F2=1eX=UF_2 = 1-e^{-X} = U for the other half. Consider:

X{ln(U),w/ probability 1/2ln(U),w/ probability 1/2X \leftarrow \left\{ \begin{matrix} \ln(U), & \text{w/ probability } 1/2 \\ -\ln(U), & \text{w/ probability } 1/2 \\ \end{matrix} \right.

Precisely, we transform our uniform into an exponential with probability one-half and a negative exponential with probability negative one-half.

Box-Muller Normal RVs

In this lesson, we will talk about the Box-Muller method, which is a special-case algorithm used to generate standard normal random variables.

Box-Muller Method

If U1,U2U_1, U_2 are iid (0,1)\mathcal (0,1), then the following quantities are iid Nor(0,1):

Z1=2ln(U1)cos(2πU2)Z2=2ln(U1)sin(2πU2)\begin{alignedat}{1} Z_1 &= \sqrt{-2\ln{(U_1)}}\cos(2\pi U_2) \\ Z_2 &= \sqrt{-2\ln{(U_1)}}\sin(2\pi U_2) \end{alignedat}

Note that we must perform the trig calculations in radians! In degrees, 2πU2\pi U is a very small quantity, and resulting Z1Z_1 and Z2Z_2 will not be iid Nor(0,1).

Interesting Corollaries

We've mentioned before that if we square a standard normal random variable, we get a chi-squared random variable with one degree of freedom. If we add nn chi-squared randoms, we get a chi-squared with nn degrees of freedom:

Z12+Z22χ2(1)+χ2(1)χ2(2)Z^2_1 + Z^2_2 \sim \chi^2(1) + \chi^2(1) \sim \chi^2(2)

Meanwhile, let's consider the sum of Z12Z^2_1 and Z22Z^2_2 algebraically:

Z12+Z22=2ln(U1)(cos2(2πU2)+sin2(2πU2))Z^2_1 + Z^2_2 = -2\ln(U_1)(\cos^2(2\pi U_2) + \sin^2(2\pi U_2))

Remember from trig class the sin2(x)+cos2(x)=1\sin^2(x) + \cos^2(x) = 1, so:

Z12+Z22=2ln(U1)Z^2_1 + Z^2_2 = -2\ln(U_1)

Remember also how we transform exponential random variables:

X=1λlnUExp(λ)X = \frac{-1}{\lambda}\ln U \sim \text{Exp}(\lambda)

All this to say, we just demonstrated that Z12+Z22Exp(1/2)Z^2_1 + Z^2_2 \sim \text{Exp}(1/2). Furthermore, and more interestingly, we just proved that:

χ2(2)Exp(1/2)\chi^2(2) \sim \text{Exp}(1/2)

Let's look at another example. Suppose we take Z1/Z2Z_1/Z_2. One standard normal divided by another is a Cauchy random variable, and also a t(1) random variable.

As an aside, we can see how Cauchy random variables take on such a wide range of values. Suppose the normal in the denominator is very close to zero, and the normal in the numerator is further from zero. In that case, the quotient can take on very large positive and negative values.

Moreover:

Z2/Z1=2ln(U1)cos(2πU2)2ln(U1)sin(2πU2)=tan(2πU2)Z_2/Z_1 = \frac{\sqrt{-2\ln{(U_1)}}\cos(2\pi U_2)}{\sqrt{-2\ln{(U_1)}}\sin(2\pi U_2)} = \tan(2\pi U_2)

Thus, we've just proven that tan(2πU)Cauchy\tan(2\pi U) \sim \text{Cauchy}. Likewise, we can take Z1/Z2Z_1/Z_2, which proves additionally that cot(2πU)Cauchy\cot(2\pi U) \sim \text{Cauchy}.

Furthermore:

Z22/Z12=tan2(2πU)t2(1)F(1,1)Z_2^2 / Z_1^2 = \tan^2(2\pi U) \sim t^2(1) \sim F(1,1)

Polar Method

We can also use the polar method to generate standard normals, and this method is slightly faster than Box-Muller.

First, we generate two uniforms, U1,U2iidU(0,1)U_1, U_2 \overset{\text{iid}}{\sim} \mathcal U(0,1). Next, let's perform the following transformation:

Vi=2Ui1,i=1,2W=V12+V22\begin{alignedat}{1} V_i &= 2U_i - 1, i = 1,2 \\ W &= V_1^2 + V_2^2 \end{alignedat}

Finally, we use the acceptance-rejection method. If W>1W > 1, we reject and return to sampling uniforms. Otherwise:

Y=2ln(W)/WY = \sqrt{-2\ln(W)/W}

We accept ZiViY,i=1,2Z_i \leftarrow V_iY,i=1,2. As it turns out, Z1Z_1 and Z2Z_2 are iid Nor(0,1). This method is slightly faster than Box-Muller because it avoids expensive trigonometric calculations.

Order Statistics and Other Stuff

In this lesson, we will talk about how to generate order statistics efficiently.

Order Statistics

Suppose that we have the iid observations, X1,X2,...,XnX_1, X_2,...,X_n from some distribution with cdf F(x)F(x). We are interested in generating a random variable, YY, such that YY is the minimum of the XiX_i's: Y=min{X1,...,Xn}Y = \min\{X_1,...,X_n\}. Since YY is itself a random variable, let's call its cdf G(y)G(y). Furthermore, since YY refers to the smallest XiX_i, it's called the first order statistic.

To generate YY, we could generate nn XiX_i's individually, which takes nn units of work. Could we generate YY more efficiently, using just one U(0,1)\mathcal U(0,1)?

The answer is yes! Since the XiX_i's are iid, we have:

G(y)=1P(Y>y)=1P(miniXi>y)G(y) = 1 - P(Y > y) = 1 - P(\min_iX_i > y)

In English, the cdf of YY is P(Yy)P(Y \leq y), which is equivalent to the complement: 1P(Y>y)1 - P(Y > y). Since Y=miniXiY = \min_i X_i, G(y)=1P(miniXi>y)G(y) = 1 - P(\min_i X_i > y).

Now, since the minimum XiX_i is greater than YY, then all of the XiX_i's must be greater than YY:

G(y)=1P(all Xi’s>y)=1[P(X1>y)P(X2>y)...P(Xn>y)]\begin{alignedat}{1} G(y) &= 1 - P(\text{all } X_i\text{'s} > y) \\ &= 1 - [P(X_1 > y) * P(X_2 > y) * ... * P(X_n > y)] \end{alignedat}

Since the XiX_i's are iid, P(Xi>y)=P(X1>y)P(X_i > y) = P(X_1 > y). Therefore:

G(y)=1[P(X1>y)P(X1>y)...P(X1>y)]=1[P(X1>y)]n\begin{alignedat}{1} G(y) &= 1 - [P(X_1 > y) * P(X_1 > y) * ... * P(X_1 > y)] \\ &= 1 - [P(X_1 > y)]^n \end{alignedat}

Finally, note that F(y)=XyF(y) = X \leq y. Therefore, 1F(y)=P(X>y)1 - F(y) = P(X > y), so:

G(y)=1[1F(y)]nG(y) = 1 - [1- F(y)]^n

At this point, we can use inverse transform to express YY in terms of UU:

U=1[1F(Y)]n[1F(Y)]n=1U1F(Y)=(1U)1/nY=F1(1(1U)1/n)\begin{alignedat}{1} U &= 1 - [1- F(Y)]^n \\ [1- F(Y)]^n &= 1 - U \\ 1 - F(Y) &= (1-U)^{1/n}\\ Y &= F^{-1}(1 - (1-U)^{1/n}) \end{alignedat}

Example

Suppose X1,...XnExp(λ)X_1,...X_n \sim \text{Exp}(\lambda). Then:

G(y)=1(1(1eλy))n=1enλy\begin{alignedat}{1} G(y) &= 1 - (1 - (1 - e^{-\lambda y}))^n \\ &= 1 - e^{-n\lambda y} \end{alignedat}

So, Y=miniXiExp(nλ)Y = \min_iX_i \sim \text{Exp}(n\lambda). If we apply inverse transform we get:

Y=1nλln(U)Y = \frac{-1}{n\lambda}\ln(U)

We can do the same kind of thing for Z=maxiXiZ = \max_i X_i. Let's try it! If Z=maxiXiZ = \max_iX_i, then we can express H(z)H(z) as:

P(Zz)=P(maxiXiz)P(Z \leq z) = P(\max_iX_i \leq z)

Now, since the maximum XiX_i is less than or equal to ZZ, then all of the XiX_i's must less than or equal to ZZ:

H(z)=P(all Xi’sz)=[P(X1z)P(X2z)...P(Xnz)]\begin{alignedat}{1} H(z) &= P(\text{all } X_i\text{'s} \leq z) \\ &= [P(X_1 \leq z) * P(X_2 \leq z) * ... * P(X_n \leq z)] \end{alignedat}

Since the XiX_i's are iid, P(Xiz)=P(X1z)P(X_i \leq z) = P(X_1 \leq z). Therefore:

H(z)=[P(X1z)P(X1z)...P(X1z)]=[P(X1z)]n\begin{alignedat}{1} H(z) &= [P(X_1 \leq z) * P(X_1 \leq z) * ... * P(X_1 \leq z)] \\ &= [P(X_1 \leq z)]^n \end{alignedat}

Finally, note that F(z)=XzF(z) = X \leq z. Therefore:

H(z)=F(z)nH(z) =F(z)^n

Suppose X1,...XnExp(λ)X_1,...X_n \sim \text{Exp}(\lambda). Then:

H(z)=(1eλz)nH(z) = (1 - e^{-\lambda z})^n

Let's apply inverse transform:

U=(1eλZ)nU1/n=1eλZeλZ=1U1/nλZ=ln(1U1/n)Z=ln(1U1/n)λ\begin{alignedat}{1} U &= (1 - e^{-\lambda Z})^n \\ U^{1/n} &= 1 - e^{-\lambda Z} \\ e^{-\lambda Z} &= 1 - U^{1/n} \\ -\lambda Z &= \ln(1 - U^{1/n}) \\ Z &= \frac{-\ln(1 - U^{1/n})}{\lambda} \end{alignedat}

Other Stuff

If Z1,Z2,...,ZnZ_1, Z_2,...,Z_n are iid Nor(0,1), then the sum of the squares of the ZiZ_i's is a chi-squared random variable with nn degrees of freedom:

i=1nZi2χ2(n)\sum_{i=1}^n Z^2_i \sim \chi^2(n)

If ZNor(0,1)Z \sim \text{Nor}(0,1), and Yχ2(n)Y \sim \chi^2(n), and XX and YY are independent, then:

ZY/nt(n)\frac{Z}{\sqrt{Y/n}} \sim t(n)

Note that t(1) is the Cauchy distribution.

If Xχ2(n)X \sim \chi^2(n), and Yχ2(m)Y \sim \chi^2(m), and XX and YY are independent, then:

(X/n)(Y/n)F(n,m)\frac{(X/n)}{(Y/n)} \sim F(n,m)

If we want to generate random variables from continuous empirical distributions, we'll have to settle for the CONT function in Arena for now.

Multivariate Normal Distribution

In this lesson, we will look at the multivariate normal distribution. This distribution is very important, and people use it all the time to model observations that are correlated with one another.

Bivariate Normal Distribution

Consider a random vector (X,Y)(X, Y). For example, we might draw XX from a distribution of heights and YY from a distribution of weights. This vector has the bivariate normal distribution with means μX=E[X]\mu_X = E[X] and μY=E[Y]\mu_Y = E[Y], variances σX2=Var(X)\sigma^2_X = \text{Var}(X) and σY2=Var(Y)\sigma^2_Y = \text{Var}(Y), and correlation ρ=Corr(X,Y)\rho = \text{Corr}(X,Y) if it has the following joint pdf:

f(x,y)=12πσXσY1ρ2exp{[zX2(x)+zY2(y)2ρzX(x)zY(y)]2(1ρ2)}f(x,y) = \frac{1}{2\pi \sigma_X \sigma_Y\sqrt{1-\rho^2}}\exp\left\{\frac{-\left[z^2_X(x) + z^2_Y(y) - 2 \rho z_X(x)z_Y(y)\right]}{2(1-\rho^2)}\right\}

Note the following definitions for zX(x)z_X(x) and zY(y)z_Y(y), which are standardized versions of the corresponding variables:

zX(x)xμXσX,zY(y)yμYσYz_X(x) \equiv \frac{x - \mu_X}{\sigma_X}, \quad z_Y(y) \equiv \frac{y - \mu_Y}{\sigma_Y}

Let's contrast this pdf with the univariate normal pdf:

f(x)=1σ2πexp{12(xμσ)2}f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left\{\frac{-1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right\}

The fractions in front of the exponential looks similar in both cases. In fact, we could argue that the fraction for the univariate case might look like this:

1σ2π1ρ2\frac{1}{\sigma\sqrt{2\pi}\sqrt{1-\rho^2}}

In this case, however, Corr(X,X)=1\text{Corr}(X, X) = 1, so the expression involving ρ2\rho^2 reduces to one. All this to say, the bivariate case expands on the univariate case by incorporating another σ2π\sigma\sqrt{2\pi} term. Consider:

f(x,y)=1σX2πσY2π1ρ2exp{}=12πσXσY1ρ2exp{}\begin{alignedat}{1} f(x,y) &= \frac{1}{\sigma_X \sqrt{2\pi} \sigma_Y \sqrt{2\pi}\sqrt{1-\rho^2}}\exp\{\cdots\} \\ &= \frac{1}{2\pi\sigma_X \sigma_Y \sqrt{1-\rho^2}}\exp\{\cdots\} \end{alignedat}

Now let's look at the expression inside the exponential. Consider again the exponential for the univariate case:

exp{12(xμσ)2}\exp\left\{\frac{-1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right\}

Again, we might argue that this expression contains a hidden term concerning ρ\rho that gets removed because the correlation of XX with itself is one:

exp{12(1ρ)2(xμσ)2}\exp\left\{\frac{-1}{2(1-\rho)^2}\left(\frac{x-\mu}{\sigma}\right)^2\right\}

If we look at the bivariate case again, we see something that looks similar. We retain the 1/2-1/2 coefficient and we square standardized versions of XX and YY, although the standardization part is hidden away in zXz_X and zYz_Y:

exp{[zX2(x)+zY2(y)2ρzX(x)zY(y)]2(1ρ2)}\exp\left\{\frac{-\left[z^2_X(x) + z^2_Y(y) - 2 \rho z_X(x)z_Y(y)\right]}{2(1-\rho^2)}\right\}

As we said, we can model heights and weights of people as a bivariate normal random variable because those two observations are correlated with one another. In layman's terms: shorter people tend to be lighter, and taller people tend to be heavier.

For example, consider the following observations, taken from a bivariate normal distribution where both means are zero, both variances are one, and the correlation between the two variables is 0.9.

Multivariate Normal

The random vector X=(X1,...,Xk)\bold X = (X_1,...,X_k)^\top has the multivariate normal distribution with mean vector μ=(μ1,...,μk)\boldsymbol \mu = (\mu_1,...,\mu_k)^\top and k×kk \times k covariance matrix Σ=(σij)\Sigma = (\sigma_{ij}), if it has the following pdf:

f(x)=1(2π)k/2Σ1/2exp{(xμ)Σ1(xμ)2},xRkf(\bold x) = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}} \exp\left\{-\frac{(\bold x - \boldsymbol \mu)^\top\Sigma^{-1}(\bold x - \boldsymbol \mu)}{2}\right\}, \quad x \in \mathbb{R}^k

Notice that the vector X\bold X and the vector μ\boldsymbol \mu have the same dimensions: random variable XiX_i has the expected value μi\mu_i.

Now let's talk about the covariance matrix Σ\Sigma. This matrix has kk rows and kk columns, and the cell at the iith row and jjth column holds the covariance between XiX_i and XjX_j. Of course, the cell at Σii\Sigma_{ii} holds the covariance of XiX_i with itself, which is just the variance of XiX_i.

Now let's look at the fraction in front of the exponential expression. Remember that, in one dimension, we took 2π\sqrt{2\pi}. In kk dimensions, we take 2πk/22\pi^{k/2}. In one dimension, we take σ2\sqrt{\sigma^2}. In kk dimensions, we take the square root of the determinant of Σ\Sigma, where the determinant is a generalization of the variance.

Now let's look at the exponential expression. In one dimension, we computed:

exp{12(xμσ)2}\exp\left\{\frac{-1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right\}

In kk dimensions, we compute:

exp{(xμ)Σ1(xμ)2}\exp\left\{-\frac{(\bold x - \boldsymbol \mu)^\top\Sigma^{-1}(\bold x - \boldsymbol \mu)}{2}\right\}

Here, (xμ)(xμ)(\bold x - \boldsymbol \mu)^\top(\bold x - \boldsymbol \mu) corresponds to (xμ)2(x - \mu)^2, and Σ1\Sigma^{-1} is the inverse of the covariance matrix, which corresponds to 1/σ1/\sigma in the univariate case.

All this to say, the multivariate normal distribution generalizes the univariate normal distribution.

The multivariate case has the following properties:

E[Xi]=μi,Var(Xi)=σii,Cov(Xi,Xj)=σijE[X_i] = \mu_i, \quad \text{Var}(X_i) = \sigma_{ii}, \text{Cov}(X_i, X_j) = \sigma_{ij}

We can express a random variable coming from this distribution with the following notation: XNork(μ,Σ)\bold X \sim \text{Nor}_k(\boldsymbol{\mu}, \Sigma).

Generating Multivariate Random Variables

To generate X\bold X, let's start out with a vector of iid Nor(0,1) random variables in the same dimension as X\bold X: Z=(Z1,...,Zk)\bold Z = (Z_1,...,Z_k). We can express ZZ as ZNork(0,I)\bold Z \sim \text{Nor}_k(\bold 0, I), where 0\bold 0 is a kk-dimensional vector of zeroes and II is the k×kk \times k identity matrix. Note that the identity matrix makes sense here: since the values are iid, everywhere but the diagonal is zero.

Now, suppose that we can find the lower triangular k×kk \times k Cholesky matrix CC, such that Σ=CC\Sigma = CC^\top. We can think of this matrix, informally, as the "square root" of the covariance matrix.

It can be shown that X=μ+CZ\bold X = \boldsymbol \mu + C \bold Z is multivariate normal with mean μ\boldsymbol \mu and covariance matrix:

ΣE[(CZ)(CZ)]=E[CZZC]=C(E[ZZ])C=CC\Sigma \equiv E[(C\bold Z)(C \bold Z)^\top] = E[C \bold Z \bold Z^\top C^\top] = C(E[\bold Z \bold Z^\top])C^\top = CC^\top

How do we find this magic matrix CC? For k=2k = 2, we can easily derive CC:

C=(σ110σ12σ11σ22σ122σ11)C = \left( \begin{matrix} \sqrt{\sigma_{11}} & 0 \\ \frac{\sigma_{12}}{\sqrt{\sigma_{11}}} & \sqrt{\sigma_{22} - \frac{\sigma^2_{12}}{\sigma_{11}}} \end{matrix} \right)

Again, if we multiply CC by CC^\top, we get the covariance matrix, Σ\Sigma:

Σ=CCT=(σ110σ12σ11σ22σ122σ11)×(σ11σ12σ110σ22σ122σ11)=(σ11σ12σ12σ22)\begin{alignedat}{1} \Sigma &= CC^T \\ &= \left(\begin{matrix} \sqrt{\sigma_{11}} & 0 \\ \frac{\sigma_{12}}{\sqrt{\sigma_{11}}} & \sqrt{\sigma_{22} - \frac{\sigma^2_{12}}{\sigma_{11}}} \end{matrix} \right) \times \left(\begin{matrix} \sqrt{\sigma_{11}} & \frac{\sigma_{12}}{\sqrt{\sigma_{11}}} \\ 0 & \sqrt{\sigma_{22} - \frac{\sigma^2_{12}}{\sigma_{11}}} \end{matrix} \right) \\[4ex] &= \left(\begin{matrix} \sigma_{11} & \sigma_{12} \\ \sigma_{12} & \sigma_{22} \end{matrix} \right) \end{alignedat}

Here's how we generate X\bold X. Since X=μ+CZ\bold X = \boldsymbol \mu + C \bold Z, if we carry the vector addition and matrix multiplication out, we have:

X1=μ1+σ11Z1+0Z1X2=μ2+σ12σ11Z1+σ22σ122σ11Z2\begin{alignedat}{1} X_1 &= \mu_1 + \sqrt{\sigma_{11}}Z_1 + 0Z_1 \\ X_2 &= \mu_2 + \frac{\sigma_{12}}{\sqrt{\sigma_{11}}}Z_1 + \sqrt{\sigma_{22} - \frac{\sigma^2_{12}}{\sigma_{11}}}Z_2 \end{alignedat}

Here's the algorithm for computing a kk-dimensional Cholesky matrix, CC.

Once we compute CC, we can easily generate the multivariate normal random variable X=μ+CZ\bold X = \boldsymbol \mu + C \bold Z. First, we generate kk iid Nor(0,1) uniforms, Z1,Z2,...,ZkZ_1, Z_2,...,Z_k. Next, we set XiX_i equal to the following expression:

Xiμi+j=1icijZj,i=1,2,...,kX_i \leftarrow \mu_i + \sum_{j=1}^i c_{ij}Z_j, i = 1,2,...,k

Note that the sum we take above is nothing more than the sum of the standard normals multiplied by the iith row in Σ\Sigma.

Finally, we return X=(X1,X2,Xk)\bold X = (X_1, X_2, X_k).

Baby Stochastic Processes (OPTIONAL)

In this lesson, we will talk about how we generate some easy stochastic processes: Markov chains and Poisson arrivals.

Markov Chains

A time series is a collection of observations ordered by time: day to day or minute to minute, for instance. We will look at a time series of daily weather observations and determine the probability of moving between two different states, sunny and rainy.

It could be the case that the days are not independent of one another; in other words, if we have thunderstorms today, there is likely a higher than average probability that we will have thunderstorms tomorrow.

Suppose it turns out the probability that we have a thunderstorm tomorrow only depends on whether we had a thunderstorm today, and nothing else. In that case, the sequence of weather outcomes is a type of stochastic process called a Markov chain.

Example

Let's let Xi=0X_i = 0 if it rains on day ii; otherwise, Xi=1X_i = 1. Note that we aren't dealing with Bernoulli observations here because the trials are not independent; that is, the weather on day i+1i+1 depends on the weather on day ii.

We can denote the day-to-day transition probabilities with the following equation:

Pjk=P(state k on day i+1 state j on day i),j,k=0,1P_{jk} = P(\text{state } k \text{ on day } i + 1 | \text{ state } j \text{ on day } i), \quad j,k = 0,1

In other words, PjkP_{jk} refers to the probability of experiencing state X=kX = k on day i+1i+1, given that we experienced state X=jX = j on day ii. For example, P01P_{01} refers to the probability that it is sunny tomorrow, given that it rained today.

Suppose that we know the various transition probabilities, and we have the following probability state matrix:

P=(P00P01P10P11)=(0.70.30.40.6)P = \begin{pmatrix} P_{00} & P_{01} \\ P_{10} & P_{11} \end{pmatrix} = \begin{pmatrix} 0.7 & 0.3 \\ 0.4 & 0.6 \end{pmatrix}

Remember that state zero is rain and state one is sun. Therefore, P00=0.7P_{00} = 0.7 means that the probability that we have rain on day i+1i+1 given that we had rain on day ii equals 0.7. Likewise, P01=0.3P_{01} = 0.3 means that the probability that we have sun on day i+1i+1 given that we had rain on day ii equals 0.3.

Notice that the probabilities add up to one across the rows, signifying that, for a given state on day ii, the probability of moving to some state on day i+1i+1 is one: the process must transition to a new state (of course, this new state could be equivalent to the current state).

Suppose it rains on Monday, and we want to simulate the rest of the week. We will fill out the following table:

dayP(RXi1)UiUi<P0?R/SMR\begin{array}{ccccc} \text{day} & P(R|X_{i-1}) & U_i & U_i < P_{\cdot 0}? & \text{R/S} \\ \hline \text{M} & - & - & - & \text{R} \end{array}

Let's talk about the columns. Obviously, the first column refers to the day of the week. The column titled P(RXi1)P(\text{R} | X_{i-1}) refers to the probability of rain given yesterday's weather. Note that we also represent this probability with P0P_{\cdot 0}, where the dot refers to yesterday's state.

In the third column, we sample a uniform. In the fourth column, we check whether that uniform is less than the transition probability from yesterday's state to rain. Finally, in the fifth column, we remark whether it rains, based on whether the uniform is less than or equal to the transition probability.

Let's see what happens on Tuesday:

dayP(RXi1)UiUi<P0?R/SMRTuP00=0.70.62YR\begin{array}{ccccc} \text{day} & P(R|X_{i-1}) & U_i & U_i < P_{\cdot 0}? & \text{R/S} \\ \hline \text{M} & - & - & - & \text{R} \\ \text{Tu} & P_{00} = 0.7 & 0.62 & \text{Y} & \text{R} \end{array}

Here, we are looking at the probability of transitioning from a rainy day to another rainy day. If we look up that particular probability in our state transition matrix, we get P00=0.7P_{00} = 0.7. Next, we sample a uniform, Ui=0.62U_i = 0.62, and check whether that uniform is less than the transition probability. Since 0.62<0.70.62 < 0.7, we say that it will rain on Tuesday.

Let's see what happens on Wednesday:

dayP(RXi1)UiUi<P0?R/SMRTuP00=0.70.62YRWP00=0.70.03YR\begin{array}{ccccc} \text{day} & P(R|X_{i-1}) & U_i & U_i < P_{\cdot 0}? & \text{R/S} \\ \hline \text{M} & - & - & - & \text{R} \\ \text{Tu} & P_{00} = 0.7 & 0.62 & \text{Y} & \text{R} \\ \text{W} & P_{00} = 0.7 & 0.03 & \text{Y} & \text{R} \end{array}

Similarly, we are looking at the probability of again transitioning from a rainy day to another rainy day, so our transition probability is still P00=0.7P_{00} = 0.7. We draw another uniform, Ui=0.03U_i = 0.03, and, since 0.03<0.70.03 < 0.7, we say that it will rain on Wednesday.

Let's see what happens on Thursday:

dayP(RXi1)UiUi<P0?R/SMRTuP00=0.70.62YRWP00=0.70.03YRThP00=0.70.77NS\begin{array}{ccccc} \text{day} & P(R|X_{i-1}) & U_i & U_i < P_{\cdot 0}? & \text{R/S} \\ \hline \text{M} & - & - & - & \text{R} \\ \text{Tu} & P_{00} = 0.7 & 0.62 & \text{Y} & \text{R} \\ \text{W} & P_{00} = 0.7 & 0.03 & \text{Y} & \text{R} \\ \text{Th} & P_{00} = 0.7 & 0.77 & \text{N} & \text{S} \end{array}

Here, we have the same transition probability as previously, but our uniform, Ui=0.77U_i = 0.77, happens to be larger than our transition probability: 0.77>0.70.77 > 0.7. As a result, we say that it will not rain on Thursday.

Finally, let's look at Friday:

dayP(RXi1)UiUi<P0?R/SMRTuP00=0.70.62YRWP00=0.70.03YRThP00=0.70.77NSFP10=0.40.91NS\begin{array}{ccccc} \text{day} & P(R|X_{i-1}) & U_i & U_i < P_{\cdot 0}? & \text{R/S} \\ \hline \text{M} & - & - & - & \text{R} \\ \text{Tu} & P_{00} = 0.7 & 0.62 & \text{Y} & \text{R} \\ \text{W} & P_{00} = 0.7 & 0.03 & \text{Y} & \text{R} \\ \text{Th} & P_{00} = 0.7 & 0.77 & \text{N} & \text{S} \\ \text{F} & P_{10} = 0.4 & 0.91 & \text{N} & \text{S} \\ \end{array}

On Friday, we have a new transition probability. This time, we are looking at the probability of rain given sun, and P10=0.4P_{10} = 0.4. Again we draw our uniform, which happens to be greater than the transition probability, so we say it will be sunny on Friday.

Poisson Arrivals

Let's suppose we have a Poisson(λ\lambda) process with a constant arrival rate of λ\lambda. The interarrival times of such a process are iid Exp(λ\lambda). Remember that, even though the rate is constant, the arrivals themselves are still random because the interarrival times are random.

Let's look at how we generate the arrival times. Note that we set T00T_0 \leftarrow 0 to initialize the process at time zero. From there, we can compute the iith arrival time, TiT_i, as:

TiTi11λlnUi,i1T_i \leftarrow T_{i-1} - \frac{1}{\lambda}\ln U_i, \quad i \geq 1

In other words, the next arrival time equals the previous arrival time, plus an exponential random variable referring to the interarrival time. Even more basically, the time at which the next person shows up is equal to the time at which the last person showed up, plus some randomly generated interarrival time.

Note that, even though we seem to be subtracting from Ti1T_{i-1}, the natural log of 0Ui<10 \leq U_i < 1 is a negative number, so we are, in fact, adding two positive quantities here.

We refer to this iterative process as bootstrapping, whereby we build subsequent arrivals on previous arrivals.

Fixed Arrivals

Suppose we want to generate a fixed number, nn, arrivals from a Poisson(λ\lambda) process in a fixed time interval [a,b][a,b]. Of course, we cannot guarantee any number of arrivals in a fixed time interval in a Poisson process because the interarrival times are random.

It turns out that there is a theorem that states the following: the joint distribution of nn arrivals from the Poisson process during some interval [a,b][a,b] is equivalent to the joint distribution of the order statistics of nn iid U(a,b)\mathcal U(a,b) random variables.

Here's the algorithm. First, we generate nn iid U(0,1)\mathcal U(0,1) uniforms, U1,...,UnU_1,...,U_n. Next, we sort them: U(1)<U(2)<...<U(n)U_{(1)} < U_{(2)} < ... < U_{(n)}. Finally, we transform them to lie on the interval [a,b][a,b]:

Tia+(ba)U(i)T_i \leftarrow a + (b-a)U_{(i)}

Nonhomogeneous Poisson Processes

In this lesson, we will look at nonhomogeneous Poisson processes (NHPPs), where the arrival rate changes over time. We have to be careful here: we can't use the algorithm we discussed last time to generate arrivals from NHPPs.

NHPPs - Nonstationary Arrivals

An NHPP maintains the same assumptions as a standard Poisson process, except that the arrival rate, λ\lambda, isn't constant, so the stationary increments assumption no longer applies.

Let's define the function λ(t)\lambda(t), which describes the arrival rate at time tt. Let's also define the function N(t)N(t), which counts the number of arrivals during [0,t][0,t].

Consider the following theorem to describe the number of arrivals from an NHPP between time ss and time tt:

N(s+t)N(s)Poisson(ss+tλ(u)du)N(s+t) - N(s) \sim \text{Poisson}\left(\int_s^{s+t} \lambda(u)du\right)

Example

Suppose that the arrival pattern to the Waffle House over a certain time period is an NHPP with λ(t)=t2\lambda(t) = t^2. Let's find the probability that there will be exactly four arrivals between times t=1t=1 and t=2t=2.

From the equation above, we can see that the number of arrivals in this time interval is:

N(2)N(1)Pois(12t2dt)Pois(7/3)N(2) - N(1) \sim \text{Pois}\left(\int_1^{2} t^2dt\right) \sim \text{Pois}(7/3)

Since the number of arrivals is XPois(7/3)X \sim \text{Pois}(7/3), we can calculate the P(X=4)P(X = 4) using the pmf, f(x)f(x):

f(x)=eλ(λ)xx!=e7/3(7/3)44!=0.120f(x) = \frac{e^{-\lambda}(\lambda)^x}{x!} = \frac{e^{-7/3}(7/3)^4}{4!} = 0.120

Incorrect NHPP Algorithm

Now, let's look at an incorrect algorithm for generating NHPP arrivals. This algorithm is particularly bad because it can "skip" intervals if λ(t)\lambda(t) grows very quickly.

Here's the algorithm.

We initialize the algorithm with T00;i0T_0 \leftarrow 0; i \leftarrow 0. To generate the iith arrival, we first sample UU(0,1)U \sim \mathcal U(0,1), and then we perform the following computation:

Ti+1Ti1λ(Ti)lnUT_{i+1} \leftarrow T_i - \frac{1}{\lambda(T_i)}\ln U

Note that, instead of multiplying by 1/λ-1 / \lambda, we multiply by 1/λ(Ti)-1 / \lambda(T_i), which is the arrival rate at the time of the previous arrival. The problem arises when the arrival rate radically changes between the time we generate the new arrival and the time that arrival occurs.

Specifically, if λ(Ti+1)>>λ(Ti)\lambda(T_{i+1}) >> \lambda(T_{i}), then the arrival rate we use to generate arrival Ti+1T_{i+1}, evaluated at time TiT_i, is too small. Since we capture the arrival rate of the next arrival at the time of the current arrival, we essentially consider the rate to be fixed between that time and the time the arrival actually occurs.

Thinning Algorithm

Consider the following rate function, λ(t)\lambda(t), graphed below. We can suppose, perhaps, that this function depicts the arrivals in a restaurant, and the peaks correspond to breakfast, lunch, and dinner rushes.

We are going to use the thinning algorithm to generate the iith arrival. First, we assume that λ=maxtλ(t)\lambda^* = \max_t\lambda(t) is finite. We can see the line y=λy = \lambda^* drawn below.

From there, we will generate potential arrivals at that maximum rate λ\lambda^*:

Ti+1=T11λlnUT_{i+1} = T_1 - \frac{-1}{\lambda^*}\ln U

We can see the potential arrivals in dotted purple below.

We will accept a potential arrival at time tt as a real arrival with probability λ(t)/λ\lambda(t) / \lambda^*. Thus, as λ(t)\lambda(t) approaches λ\lambda^*, the likelihood that we accept tt as a real arrival increases proportionally to λ(t)/λ\lambda(t) / \lambda^*. We call this algorithm the thinning algorithm because it thins out potential arrivals.

Let's look at the algorithm.

We initialize the algorithm T00;i0T_0 \leftarrow 0; i \leftarrow 0. Next, we initialize tTit \leftarrow T_i, and then we iterate. We generate two iid Unif(0,1) random variables, UU and VV, and then we generate a potential arrival:

tt1λlnUt \leftarrow t - \frac{-1}{\lambda^*}\ln U

We keep generating UU and VV until this condition holds:

Vλ(t)/λV \leq \lambda(t)/\lambda^*

In other words, we only keep the potential arrival with probability λ(t)/λ\lambda(t)/\lambda^*. After we accept tt, we set ii to i+1i+1, and TiT_i to tt, and repeat.

DEMO

In this demo, we will look at how to implement an NHPP in Excel.

In column A, we have potential NHPP arrival times.

In column B, we have the rate function λ(t)=1+sin(t/5)\lambda(t) = 1 + \sin(t/5). Because the sine function is periodic, so are the arrival rates generated by λ(t)\lambda(t). In fact, λ(t)\lambda(t) never goes below zero, and never goes above two, so λ=2\lambda^* = 2.

In column D, we have a sequence of UiU(0,1)U_i \sim \mathcal U(0,1) random variables, which we will use to generate interarrival times.

In column E, we generate the potential interarrival times by transforming UiU_i into an Exp(λ\lambda^*) random variable.

In column F, we bootstrap the arrivals, generating the iith arrival by adding the i1i-1th arrival and the iith interarrival time. Note that these potential arrivals are the same values in column A.

In column G, we generate another sequence of random variables ViU(0,1)V_i \sim \mathcal U(0,1), which we will use to determine whether we accept the potential arrival as a real arrival.

In column H, we generate λ(t)/λ\lambda(t)/\lambda^*, which refers to the probability that we accept the corresponding potential arrival at time tt.

If Vi<λ(ti)/λV_i < \lambda(t_i)/\lambda^*, then we accept the arrival, and we show this boolean value in column I.

In column J, we copy over the real arrival times for all potential arrivals where Vi<λ(ti)/λV_i < \lambda(t_i)/\lambda^*. Otherwise, we mark the cell as "FALSE".

Finally, in column K, we update the arrival rate for the next iteration.

Here's what the plot of arrival rates over time looks like for this NHPP. Every dot represents an accepted arrival. In the space between the dots, we may have generated potential arrivals, but we didn't accept them as real.

We can see that we have many real arrivals when the λ(t)\lambda(t) is close to λ\lambda^* and few real arrivals when the two values diverge. Of course, this observation makes sense, given that we accept potential arrivals with probability λ(t)/λ\lambda(t) / \lambda^*.

Time Series (OPTIONAL)

In this lesson, we will look at various time series processes, most of which are used in forecasting. In particular, we will be looking at auto-regressive moving average processes (ARMA), which have standard-normal noise, and auto-regressive Pareto (ARP) processes, which have Pareto noise.

First-Order Moving Average Process

A first-order moving average process, or MA(1), is a popular time series for modeling and detecting trends. This process is defined as follows:

Yi=ϵi+θϵi1,i=1,2,...,Y_i = \epsilon_i + \theta\epsilon_{i-1}, \quad i = 1,2,...,

Here, θ\theta is a constant, and the ϵi\epsilon_i terms are typically iid Nor(0,1) - in practice, they can be iid anything - and they are independent of Y0Y_0.

Notice that Y1=ϵ1+θϵ0Y_1 = \epsilon_1 + \theta\epsilon_0, and Y2=ϵ2+θϵ1Y_2 = \epsilon_2 + \theta\epsilon_1. Both Y1Y_1 and Y2Y_2 contain an ϵ1\epsilon_1 term; generally, YiY_i and Yi1Y_{i-1} pairs are correlated.

Let's derive the variance of YiY_i. Remember, that:

Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)

Let's consider Yi=ϵi+θϵi1Y_i = \epsilon_i + \theta\epsilon_{i-1}:

Var(ϵi+θϵi1)=Var(ϵi)+Var(θϵi1)+2Cov(ϵi,θϵi1)\text{Var}(\epsilon_i + \theta\epsilon_{i-1}) = \text{Var}(\epsilon_i) + \text{Var}(\theta\epsilon_{i-1}) + 2\text{Cov}(\epsilon_i,\theta\epsilon_{i-1})

Now, we know that successive ϵi\epsilon_i's are independent, so the covariance is zero:

Var(ϵi+θϵi1)=Var(ϵi)+Var(θϵi1)\text{Var}(\epsilon_i + \theta\epsilon_{i-1}) = \text{Var}(\epsilon_i) + \text{Var}(\theta\epsilon_{i-1})

Furthermore, we know that Var(aX)=a2Var(X)\text{Var}(aX) = a^2\text{Var}(X), so:

Var(ϵi+θϵi1)=Var(ϵi)+θ2Var(ϵi1)\text{Var}(\epsilon_i + \theta\epsilon_{i-1}) = \text{Var}(\epsilon_i) + \theta^2\text{Var}(\epsilon_{i-1})

Finally, since the ϵi\epsilon_i's are Nor(0,1), they both have variance one, leaving us with:

Var(ϵi+θϵi1)=1+θ2\text{Var}(\epsilon_i + \theta\epsilon_{i-1}) = 1 + \theta^2

Now, let's look at the covariance between successive Yi,Yi+1Y_i, Y_{i+1} pairs. Here's a covariance property:

Cov(aX+bY,cW+dV)=acCov(X,W)+adCov(X,V)+bcCov(Y,W)+bdCov(Y,V)\begin{alignedat}{1} & \text{Cov}(aX + bY, cW + dV) = \\ & ac\text{Cov}(X,W) + ad\text{Cov}(X,V) + bc\text{Cov}(Y,W) + bd\text{Cov}(Y,V) \end{alignedat}

So:

Cov(ϵi+θϵi1,ϵi+1+θϵi)=Cov(ϵi,ϵi+1)+θCov(ϵi,ϵi)+θCov(ϵi1,ϵi+1)+θ2Cov(ϵi1,ϵi)\begin{alignedat}{1} & \text{Cov}(\epsilon_i + \theta\epsilon_{i-1}, \epsilon_{i+1} + \theta\epsilon_i) = \\ & \text{Cov}(\epsilon_i, \epsilon_{i+1}) + \theta\text{Cov}(\epsilon_i,\epsilon_i) + \theta\text{Cov}(\epsilon_{i-1},\epsilon_{i+1}) + \theta^2\text{Cov}(\epsilon_{i-1},\epsilon_i) \end{alignedat}

Since the ϵi\epsilon_i's are independent, we know that Cov(ϵi,ϵj)=0,ij\text{Cov}(\epsilon_i, \epsilon_j) = 0, i \neq j. So:

Cov(ϵi+θϵi1,ϵi+1+θϵi)=θCov(ϵi,ϵi)\text{Cov}(\epsilon_i + \theta\epsilon_{i-1}, \epsilon_{i+1} + \theta\epsilon_i) = \theta\text{Cov}(\epsilon_i,\epsilon_i)

We also know that Cov(X,X)=Var(X)\text{Cov}(X,X) = \text{Var}(X), and, since ϵiNor(0,1)\epsilon_i \sim \text{Nor}(0,1), Var(ϵi)=1\text{Var}(\epsilon_i) = 1. So:

Cov(ϵi+θϵi1,ϵi+1+θϵi)=θ\text{Cov}(\epsilon_i + \theta\epsilon_{i-1}, \epsilon_{i+1} + \theta\epsilon_i) = \theta

We can also demonstrate, using the above formula, that Cov(Yi,Yi+k)=0,k2\text{Cov}(Y_i, Y_{i+k}) = 0, k \geq 2.

As we can see, the covariances die off pretty quickly. We probably wouldn't use an MA(1) process to model, say month-to-month unemployment, because the unemployment rate among three months is correlated, and this type of process cannot express that.

How might we generate an MA(1) process? Let's start with ϵ0Nor(0,1)\epsilon_0 \sim \text{Nor}(0,1). Then, we generate ϵ1Nor(0,1)\epsilon_1 \sim \text{Nor}(0,1) to get Y1Y_1, ϵ2Nor(0,1)\epsilon_2 \sim \text{Nor}(0,1) to get Y2Y_2, and so on. Every time we generate a new ϵi\epsilon_i, we can compute the corresponding YiY_i.

First-Order Autoregressive Process

Now let's look at a first-order autoregressive process, or AR(1), which is used to model many different real-world processes. The AR(1) process is defined as:

Yi=ϕYi1+ϵi,i=1,2,...,Y_i = \phi Y_{i-1} + \epsilon_i, i =1,2,...,

In order for this process to remain stationary, we need to ensure that 1<ϕ<1-1 < \phi < 1, Y0Nor(0,1)Y_0 \sim \text{Nor}(0,1), and that the ϵi\epsilon_i's are iid Nor(0,1ϕ2)\text{Nor}(0, 1 - \phi^2).

Unlike the MA(1) process, non-consecutive observations in the AR(1) process have non-zero correlation. Specifically, the covariance function between YiY_i and Yi+kY_{i+k} is as follows:

Cov(Yi,Yi+k)=ϕk,k=0,±1,±2\text{Cov}(Y_i, Y_{i+k}) = \phi^{|k|}, k = 0, \pm 1, \pm 2

The correlations start out large for consecutive observations and decrease as kk increases, but they never quite become zero; there is always some correlation between any two observations in the series.

If ϕ\phi is close to one, observations in the series are highly positively correlated. Alternatively, if ϕ\phi is close to zero, observations in the series are highly negatively correlated. If ϕ\phi is close to zero, then the YiY_i's are nearly independent.

How do we generate observations from an AR(1) process? Let's start with Y0Nor(0,1)Y_0 \sim \text{Nor}(0,1) and ϵ11ϕ2Nor(0,1)\epsilon_1 \sim \sqrt{1-\phi^2}\text{Nor}(0,1) to get Y1=ϕY0+ϵ1Y_1 = \phi Y_0 + \epsilon_1. Then, we can generate ϵ21ϕ2Nor(0,1)\epsilon_2 \sim \sqrt{1-\phi^2}\text{Nor}(0,1) to get Y2=ϕY1+ϵ2Y_2 = \phi Y_1 + \epsilon_2, and so on.

Remember that:

aXNor(aμ,a2σ2),XNor(0,1)aX \sim \text{Nor}(a\mu , a^2\sigma^2), \quad X \sim \text{Nor}(0,1)

So:

1ϕ2ϵiNor(0,1ϕ2),ϵiNor(0,1)\sqrt{1-\phi^2}\epsilon_i \sim \text{Nor}(0, 1-\phi^2), \quad \epsilon_i \sim \text{Nor}(0,1)

Here are the plots of three AR(1) processes, each parameterized with a different value of ϕ\phi.

ARMA(p,q) Process

The ARMA(pp,qq) process is an obvious generalization of the MA(1) and AR(1) processes, which consists of a ppth order AR process and a qqth order MA process, which we will define as:

Yi=j=1p=ϕjYij+ϵi+j=1qθjϵij,i=1,2,...,Y_i = \sum_{j=1}^p = \phi_j Y_{i-j} + \epsilon_i + \sum_{j=1}^q \theta_j\epsilon_{i-j}, \quad i = 1,2,...,

In the first sum, we can see the pp autoregressive components, and, in the second sum, we can see the qq moving average components.

We have to take care to choose the ϕj\phi_j's and the θj\theta_j's in such a way to ensure that the process doesn't blow up. In any event, ARMA(pp, qq) processes are used in a variety of forecasting and modeling applications.

Exponential AR Process

An exponential autoregressive process, or EAR, is an autoregressive process that has exponential, not normal, noise. Here's the definition:

Yi={ϕYi1,w.p. ϕϕYi1+ϵi,w.p. 1ϕi=1,2,...,Y_i = \begin{cases} \phi Y_{i-1}, & \text{w.p. } \phi \\ \phi Y_{i-1} + \epsilon_i, & \text{w.p. } 1 - \phi \\ \end{cases} \quad i = 1,2,...,

Here we need to ensure that 0ϕ<10 \leq \phi < 1, Y0Exp(1)Y_0 \sim \text{Exp}(1), and the ϵi\epsilon_i's are iid Exp(1) random variables.

Believe it or not, the EAR(1) has the same covariance function as the AR(1), except that the bounds of ϕ\phi are different:

Cov(Yi,Yi+k)=ϕk,0ϕ<1\text{Cov}(Y_i, Y_{i+k}) = \phi^{|k|}, 0 \leq \phi < 1

Let's look at a plot of an EAR(1) process with ϕ=0\phi = 0. These observations are iid exponential, and if we were to make a histogram of these observations, it would resemble the exponential distribution.

Here's a plot of an EAR(1) process with ϕ=0.95\phi = 0.95. We can see that when successive observations decrease, they appear to decrease exponentially. With probability 1ϕ=0.051 - \phi = 0.05, they receive a jump from ϵi\epsilon_i, which causes the graph to shoot up.

Autoregressive Pareto - ARP

A random variable XX has the Pareto distribution with parameters λ>0\lambda > 0 and β>0\beta > 0 if it has the cdf:

FX(x)=1(λ/x)β,xλF_X(x) = 1 - (\lambda/x)^\beta, x \geq \lambda

This distribution has a very fat tail. In other words, its pdf approaches zero (its cdf approaches one) much more slowly than the normal distribution.

To obtain the ARP process, let's first start off with a regular AR(1) process with normal noise:

Yi=ρYi1+ϵi,i=1,2,...,Y_i = \rho Y_{i-1} + \epsilon_i, \quad i = 1,2,...,

Recall that we need to ensure that 1<ρ<1-1 < \rho < 1, Y0Nor(0,1)Y_0 \sim \text{Nor}(0,1), and that the ϵi\epsilon_i's are iid Nor(0,1ρ2)\text{Nor}(0, 1 - \rho^2). Note that that YiY_i's are marginally Nor(0,1), but they are correlated on the order of ρk\rho^{|k|}.

Since the YiY_i's are standard normal random variables, we can plug them into their own cdf, Φ()\Phi(\cdot) to obtain Unif(0,1) random variables: Ui=Φ(Yi),i=1,2,...,U_i = \Phi(Y_i), i = 1,2,...,. Since the YiY_i's are correlated, so are the UiU_i's.

Now, we can feed the correlated UiU_i's into the inverse of the Pareto cdf to obtain correlated Pareto observations:

Xi=FX1(Ui)=FX1(Φ(Yi))=λ[1Φ(Yi)]1/β,i=1,2,...,X_i = F_X^{-1}(U_i) = F_X^{-1}(\Phi(Y_i)) = \frac{\lambda}{[1 - \Phi(Y_i)]^{1/\beta}}, \quad i = 1,2,...,

DEMO

In this demo, we will implement an EAR(1) process in Matlab. Consider the following code:

phi = 0.0;
m = 100;
X = [];
X(1) = exprnd(1,1);
for i=2:m
  X(i) = phi*X(i-1) + (exprnd(1,1) * binornd(1,1-phi));
  i = i+1;
end
Y = [1:m];
plot(Y,X,Y,3,Y,-1)

We start out with phi equal to zero, so we are generating simple exponential observations. We initialize X(1) to exprnd(1,1), which is one Exp(1) observation. Then we iterate.

On each iteration, we set subsequent X(i) entries to phi*X(i-1) + (exprnd(1,1) * binornd(1,1-phi)). Notice the second term in this expression. This term represents adding exponential noise with probability phi. The binornd function returns zero with probability phi, zeroing out the noise, and returns one with probability 1-phi.

Let's see the plot.

Now let's increase phi to 0.95 and look at the plot again. As we saw previously, we see what looks like exponential decay punctuated by random jumps.

Queueing (OPTIONAL)

In this lesson, we will discuss an easy way to generate random variables associated with queueing processes.

M/M/1

Let's consider a single-server queue with customers arriving according to a Poisson(λ\lambda) process. As we know, the interarrival times are iid Exp(λ\lambda), and the service times are iid Exp(μ\mu). Customers potentially wait in a FIFO queue upon arrival, depending on the state of the server. In general, we require that μ>λ\mu > \lambda; if customers don't get served faster than they arrive, the queue grows unbounded.

In terms of notation, we let Ii+1I_{i+1} denote the interrival between the iith and (i+1)(i+1)st customer. We let SiS_i be the iith customer's service time, and we let WiQW^Q_i denote the iith customer's wait before service.

Lindley Equations

Lindley gives a simple equation for generating a series of waiting times where we don't need to worry about the exponential assumptions. Here's how we generate the (i+1)(i+1)st queue time:

Wi+1Q=max{WiQ+SiIi+1,0}W^Q_{i+1} = \max\{W^Q_{i} + S_{i} - I_{i+1}, 0\}

This expression makes sense. If the iith customer waited a long time, we expect the (i+1)(i+1)st customer to wait a long time. Similarly, if the iith customer has a long service time, we expect the (i+1)(i+1)st customer to wait a long time. However, if the (i+1)(i+1)st interarrival time is long, perhaps the system had time to clear out, and the (i+1)(i+1) customer might not have to wait so long. Of course, wait times cannot be negative.

We can express the total time in the system for customer ii as WiQ+SiW_i^Q + S_i. Here's how we generate Wi+1QW^Q_{i+1}:

Wi+1=max{WiIi+1,0}+Si+1W_{i+1} = \max\{W_i - I_{i+1}, 0\} + S_{i+1}

Customer i+1i+1 has to wait until customer ii clears out of the system, which occurs in WiIi+1W_i - I_{i+1} time. Then, customer i+1i+1 must remain in the system for their service time, Si+1S_{i+1}.

Brownian Motion

In this lesson, we'll look at generating Brownian motion, as well as a few applications. Brownian motion is probably the most important stochastic process out there.

Brownian Motion

Robert Brown discovered Brownian motion when he looked at pollen under a microscope and noticed the grains moving around randomly. Brownian motion was analyzed rigorously by Einstein, who did a physics formulation of the process and subsequently won a Nobel prize for his work. Norbert Wiener establishes mathematical rigor for Brownian motion: Brownian motion is sometimes called a Wiener process.

Brownian motion is used everywhere, from financial analysis to queueing theory to chaos theory to statistics to many other operations research and industrial engineering domains.

The stochastic process, {W(t),t0}\{\mathcal{W}(t), t \geq 0 \}, is standard Brownian motion if:

  • W(0)=0\mathcal{W}(0) = 0
  • W(t)Nor(0,t)\mathcal{W}(t) \sim \text{Nor}(0,t)
  • {W(t),t0}\{\mathcal{W}(t), t \geq 0 \} has stationary and independent increments.

Let's talk about the first two points. A standard Brownian motion process always initializes at zero, and the distribution of the process, evaluated at any point, tt, is Nor(0, tt). This result means that a Brownian motion process has a correlation structure.

An increment describes how much the process changes between two points. For example, W(b)W(a)\mathcal{W}(b) - \mathcal{W}(a) describes the increment from time aa to time bb. If a process has stationary increments, then the distribution of how much the process changes over an interval from tt to t+ht + h only depends on the length of the interval, hh.

We can recall our discussion of stationary increments from when we talked about arrivals. For example, if the number of customers arriving at a restaurant between 2 a.m and 5 a.m has the same distribution as the number of customers arriving between 12 p.m and 3 p.m, we would say that the customer arrival process has stationary increments.

Now let's talk about independent increments. A process has independent increments if, for a<b<c<da < b < c < d, W(d)W(c)\mathcal{W}(d) - \mathcal{W}(c) is independent of W(b)W(a)\mathcal{W}(b) - \mathcal{W}(a).

How do we get Brownian motion? Let's let Y1,Y2,...,YnY_1, Y_2,...,Y_n be any sequence of iid random variables with mean zero and variance one. Donsker's Central Limit Theorem says that:

1ni=1ntYidW(t), as n\frac{1}{\sqrt{n}} \sum_{i=1}^{\lfloor nt \rfloor} Y_i \overset{\text d}{\to} \mathcal{W}(t), \text{ as } n \to \infty

Remember that d\overset{\text d}{\to} denotes convergence in distribution as nn gets big and \lfloor \cdot \rfloor denotes the floor or "round down" function: 3.7=3\lfloor 3.7 \rfloor = 3.

When we learned the standard central limit theorem, tt was equal to one, and the YiY_i's converged to a Nor(0,1) random variable. Note that W(1)Nor(0,1)\mathcal{W}(1) \sim \text{Nor}(0,1). As we can see, Donsker's central limit theorem is a generalization of the standard central limit theorem, as it works for all tt. Not only that, but this central limit theorem also mimics the correlation structure of all the W\mathcal{W}'s. Instead of converging to a single random variable, this sum converges to an entire stochastic process for arbitrary tt.

Let's look at an easy way to construct Brownian motion. To construct YiY_i's that have mean zero and variance one, we can take a random walk. Let's take Yi=±1Y_i = \pm 1, each with probability 1/21/2. Let's take nn of at least 100 observations. Then, let t=1/n,2/n,...,n/nt=1/n, 2/n, ...,n/n and calculate W(1/n),W(2/n),...,W(n/n)\mathcal{W} (1/n), \mathcal{W} (2/n), ..., \mathcal{W} (n/n), according to the summation above. Another choice is to simply sample YiNor(0,1)Y_i \sim \text{Nor}(0,1).

Let's construct some Brownian motion. First, we pick some "large" value of nn and start with W(0)=0\mathcal{W}(0)=0. Then:

W(in)=W(i1n)+Yin\mathcal{W}\left(\frac{i}{n}\right) = \mathcal{W}\left(\frac{i - 1}{n}\right) + \frac{Y_i}{\sqrt{n}}

Miscellaneous Properties of Brownian Motion

As it turns out, Brownian motion is continuous everywhere but is differentiable nowhere.

The covariance between two points in a Brownian motion is the minimum of the two times: Cov(W(s),W(t))=min(s,t)\text{Cov}(\mathcal{W}(s), \mathcal{W}(t)) = \min(s,t). We can use this result to prove that the area under W(t)\mathcal{W}(t) from zero to one is normal:

01W(t)dtNor(0,13)\int_0^1 \mathcal{W}(t)dt \sim \text{Nor}(0, \frac{1}{3})

A Brownian bridge, B(t)\mathcal{B}(t), is conditioned Brownian motion such that W(0)=W(1)=0\mathcal{W}(0) = \mathcal{W}(1) = 0. Brownian bridges are useful in financial analysis. The covariance structure for a Brownian bridge is Cov(B(s),B(t))=min(s,t)st\text{Cov}(\mathcal{B}(s), \mathcal{B}(t)) = \min(s,t) - st. Finally, the area under B(t)\mathcal{B}(t) from zero to one is normal:

01B(t)dtNor(0,112)\int_0^1 \mathcal{B}(t)dt \sim \text{Nor}(0, \frac{1}{12})

Geometric Brownian Motion

Now let's talk about geometric Brownian motion, which is particularly useful in financial analysis. We can model a stock price, for example, with the following process:

S(t)=S(0)exp{(μσ22)t+σW(t)},t0S(t) = S(0)\exp\left\{\left(\mu - \frac{\sigma^2}{2}\right)t + \sigma\mathcal{W}(t)\right\}, \quad t \geq 0

Let's unpack some of these terms. Of course, W(t)\mathcal{W}(t) is the Brownian motion, which provides the randomness in the stock price. The term σ\sigma refers to the stock's volatility, and μ\mu is related to the "drift" of the stock's price. Long story short: don't buy a stock unless it is drifting upward. The quantity μσ2/2\mu - \sigma^2 / 2 relates the stock's drift to its volatility, and we want this quantity to be positive. Finally, S(0)S(0) is the initial price.

Additionally, we can use a geometric Brownian motion to estimate option prices. The simplest type of option is a European call option, which permits the owner to purchase the stock at a pre-agreed strike price, kk, at pre-determined expiry date, TT. The owner pays an up-front fee for the privilege to exercise this option.

Let's look at an example. Say that we believe that IBM will sell for $120 per share in a few months. Currently, it's selling for $100 per share. It would be nice to buy 1000 shares of IBM at $100 per share today, but perhaps we don't have $100,000 on hand to make that purchase. Instead, we can buy an option today, for maybe $1.50, that allows us to buy IBM at $105 per share in a few months. If IBM does go to $120 per share, we can exercise our option, instantaneously buying IBM for $105 and selling for $120, netting a $13.50 profit ($15 minus the option price, $1.50).

If the stock drops to, say, $95 per share, then we would choose simply not to exercise the option. Obviously, it doesn't make sense to buy IBM for $105 per share using the option when we can just go to the market and buy it for less.

The question then becomes: what is a fair price to pay for an option? The value, VV of an option is:

V=erTE[(S(T)k)+]V = e^{-rT}E[(S(T) - k)^+]

The expression S(T)kS(T) - k is the profit we make from exercising an option at time TT, which we bought for kk dollars. Since we never consider taking a loss, we only consider the difference when it is positive: S(T)k+=max{0,S(T)k}S(T) - k^+ = \max\{0,S(T) - k\}.

Instead of buying the option, we could have put the money we spent, $1.50, into a bank, where it would make interest. In purchasing the option, we have to pay a penalty of erTe^{-rT}, where rr is the "risk-free" interest rate that the government is currently guaranteeing.

We can calculate VV to determine what the option is worth. For example, if we determine that the option is worth $1.50, and we can buy it for $1.30, then perhaps we stand to make some profit.

Alternatively, we can use the option merely for an insurance policy. Southwest airlines bought options on fuel prices many years ago. The fuel prices went way up, and Southwest made a fortune because they could buy fuel at a much lower price.

To estimate the expected value of the option, we can run multiple simulation replications of W(t)\mathcal{W}(t) and (S(T)k)+(S(T) - k)^+ and then take the sample average of all the ViV_i's. All we need to do is select our favorite values of rr, σ\sigma, TT, and kk, and we are off to the races.

Alternatively, we can just simulate the distribution of S(T)S(T) directly. Since W(t)Nor(0,t)\mathcal{W}(t) \sim \text{Nor}(0, t), we can use a lognormal distribution to simulate S(T)S(T) directly. Furthermore, we can look up the answer directly using the Black-Sholes equation.

How to Win a Nobel Prize

Let ϕ()\phi(\cdot) and Φ()\Phi(\cdot) denote the Nor(0,1) pdf and cdf respectively. Moreover, let's define a constant, bb:

brTσ2T2ln(k/S(0))σTb \equiv \frac{rT - \frac{\sigma^2 T}{2} - \ln(k / S(0))}{\sigma\sqrt T}

The Black-Sholes European call option value is:

erTE[S(T)k]+=erTE[S(0)exp{(μσ22)T+σW(T)}k)]+=erT[S(0)exp{(rσ22)T+σTz}k)]+ϕ(z)dz=S(0)Φ(b+σT)kerTΦ(b)\begin{alignedat}{1} & e^{-rT}E[S(T) - k]^+ \\ & = e^{-rT}E\left[S(0)\exp\left\{\left(\mu - \frac{\sigma^2}{2})T + \sigma\mathcal{W}(T)\right\} - k\right)\right]^+ \\ & = e^{-rT}\int_{-\infty}^\infty \left[S(0)\exp\left\{\left(r - \frac{\sigma^2}{2})T + \sigma\sqrt T z \right\} - k\right)\right]^+ \phi(z)dz \\ &= S(0) \Phi(b+\sigma\sqrt{T}) - ke^{-rT}\Phi(b) \end{alignedat}

DEMO

In this demo, we will generate some Brownian motion. Here's ten steps of Brownian motion in one dimension.

Here's another sample path.

Let's take 1000 steps.

Let's generate ten steps of two-dimensional Brownian motion.

Here's 1000 steps.

Here's 10000 steps.

Finally, let's look at Brownian motion in three dimensions.

OMSCS Notes is made with in NYC by Matt Schlenker.

Copyright © 2019-2023. All rights reserved.

privacy policy