Simulation

Input Analysis

89 minute read



Notice a tyop typo? Please submit an issue or open a PR.

Input Analysis

Introduction

The goal of input analysis is to ensure that the random variables we use in our simulations adequately approximate the real-world phenomena we are modeling. We might use random variables to model interarrival times, service times, breakdown times, or maintenance times, among other things. These variables don't come out of thin air; we have to specify them appropriately.

Why Worry? GIGO!

If we specify our random variables improperly, we can end up with a GIGO simulation: garbage-in-garbage-out. This situation can ruin the entire model and invalidate any results we might obtain.

GIGO Example

Let's consider a single-server queueing system with constant service times of ten minutes. Let's incorrectly assume that we have constant interarrival times of twelve minutes. We should expect to never have a line under this assumption because the service times are always shorter than the interarrival times.

However, suppose that, in reality, we have Exp(12) interarrival times. In this case, the simulation never sees the line that actually occurs. In fact, the line might get quite long, and the simulation has no way of surfacing that fact.

So What To Do?

Here's the high-level game plan for performing input analysis. First, we'll collect some data for analysis for any random variable of interest. Next, we'll determine, or at least estimate, the underlying distribution along with associated parameters: for example, Nor(30,8). We will then conduct a formal statistical test to see if the distribution we chose is "approximately" correct. If we fail the test, then our guess is wrong, and we have to go back to the drawing board.

Identifying Distributions

In this lesson, we will look at some high-level methods for examining data and guessing the underlying distribution.

Three Little Bears

We can always present data in the form of a histogram. Suppose we collect one hundred observations, and we plot the following histogram. We can't really determine the underlying distribution because we don't have enough cells.

Suppose we plot the following histogram. The resolution on this plot is too granular. We can potentially see the underlying distribution, but we risk missing the forest for the trees.

If we take the following histogram, we get a much clearer picture of the underlying distribution.

It turns out that, if we take enough observations, the histogram will eventually converge to the true pdf/pmf of the random variable we are trying to model, according to the Glivenko–Cantelli theorem.

Stem-and-Leaf

If we turn the histogram on its side and add some numerical information, we get a stem-and-leaf diagram, where each stem represents the common root shared among a collection of observations, and each leaf represents the observation itself.

Which Distribution?

When looking at empirical data, what questions might we ask to arrive at the underlying distribution? For example, can we at least tell if the observations are discrete or continuous?

We might want to ask whether the distribution is univariate or multivariate. We might be interested in someone's weight, but perhaps we need to generate height and weight observations simultaneously.

Additionally, we might need to check how much data we have available. Certain distributions lend themselves more easily to smaller samples of data.

Furthermore, we might need to communicate with experts regarding the nature of the data. For example, we might want to know if the arrival rate changes at our facility as the day progresses. While we might observe the rate directly, we might want to ask the floor supervisor what to expect beforehand.

Finally, what happens when we don't have much or any data? What if the system we want to model doesn't exist yet? How might we guess a good distribution?

Which Distribution, II?

Let's suppose we know that we have a discrete distribution. For example, we might realize that we only see a finite number of observations during our data collection process. How do we determine which discrete distribution to use?

If we want to model success and failures, we might use a Bernoulli random variable and estimate pp. If we want to look at the number of successes in nn trials, we need to consider using a binomial random variable.

Perhaps we want to understand how many trials we need until we get our first success. In that case, we need to look at a geometric random variable. Alternatively, if we want to know how many trials we need until the nnth success, we need a negative binomial random variable.

We can use the Poisson(λ\lambda) distribution to count the number of arrivals over time, assuming that the arrival process satisfies certain elementary assumptions.

If we honestly don't have a good model for the discrete data, perhaps we can use an empirical or sample distribution.

Which Distribution, III?

What if the distribution is continuous?

We might consider the uniform distribution if all we know about the data is the minimum and maximum possible values. If we know the most likely value as well, we might use the triangular distribution.

If we are looking at interarrival times from a Poisson process, then we know we should be looking at the Exp(λ\lambda) distribution. If the process is nonhomogeneous, we might have to do more work, but the exponential distribution is a good starting point.

We might consider the normal distribution if we are looking at heights, weights, or IQs. Furthermore, if we are looking at sample means or sums, the normal distribution is a good choice because of the central limit theorem.

We can use the Beta distribution, which generalizes the uniform distribution, to specify bounded data. We might use the gamma, Weibull, Gumbel, or lognormal distribution if we are dealing with reliability data.

When in doubt, we can use the empirical distribution, which is based solely on the sample.

Game Plan

As we said, we will choose a "reasonable" distribution, and then we'll perform a hypothesis test to make sure that our choice is not too ridiculous.

For example, suppose we hypothesize that some data is normal. This data should fall approximately on a straight line when we graph it on a normal probability plot, and it should look normal when we graph it on a standard plot. At the very least, it should also pass a goodness-of-fit test for normality, of which there are several.

Unbiased Point Estimation

It's not enough to decide that some sequence of observations is normal; we still have to estimate μ\mu and σ2\sigma^2. In the next few lessons, we will look at point estimation, which lets us understand how to estimate these unknown parameters. We'll cover the concept of unbiased estimation first.

Statistic Definition

A statistic is a function of the observations X1,...,XnX_1,...,X_n that is not explicitly dependent on any unknown parameters. For example, the sample mean, Xˉ\bar{X}, and the sample variance, S2S^2, are two statistics:

Xˉ=1ni=1nXiS2=1n1i=1x(XiXˉ)2\begin{alignedat}{1} \bar{X} = &\frac{1}{n} \sum_{i=1}^n X_i \\[2ex] S^2 = &\frac{1}{n-1} \sum_{i=1}^x(X_i - \bar{X})^2 \end{alignedat}

Statistics are random variables. In other words, if we take two different samples, we should expect to see two different values for a given statistic.

We usually use statistics to estimate some unknown parameter from the underlying probability distribution of the XiX_i's. For instance, we use the sample mean, Xˉ\bar{X}, to estimate the true mean, μ\mu, of the underlying distribution, which we won't normally know. If μ\mu is the true mean, then we can take a bunch of samples and use Xˉ\bar{X} to estimate μ\mu. We know, via the law of large numbers that, as nn \to \infty, Xˉμ\bar{X} \to \mu.

Point Estimator

Let's suppose that we have a collection of iid random variables, X1,...,XnX_1,...,X_n. Let T(X)T(X1,...,Xn)T(\bold{X}) \equiv T(X_1,...,X_n) be a function that we can compute based only on the observations. Therefore, T(X)T(\bold{X}) is a statistic. If we use T(X)T(\bold{X}) to estimate some unknown parameter θ\theta, then T(X)T(\bold{X}) is known as a point estimator for θ\theta.

For example, Xˉ\bar{X} is usually a point estimator for the true mean, μ=E[Xi]\mu = E[X_i], and S2S^2 is often a point estimator for the true variance, σ2=Var(X)\sigma^2 = \text{Var}(X).

T(X)T(\bold{X}) should have specific properties:

  • Its expected value should equal the parameter it's trying to estimate. This property is known as unbiasedness.
  • It should have a low variance. It doesn't do us any good if T(X)T(\bold{X}) is bouncing around depending on the sample we take.

Unbiasedness

We say that T(X)T(\bold{X}) is unbiased for θ\theta if E[T(X)]=θE[T(\bold{X})] = \theta. For example, suppose that random variables, X1,...,XnX_1,...,X_n are iid anything with mean μ\mu. Then:

E[Xˉ]=E[1ni=1nXi]=1ni=1nE[Xi]=nE[Xi]n=E[Xi]=μ\begin{alignedat}{1} E[\bar{X}] & = E\left[\frac{1}{n}\sum_{i=1}^nX_i\right] \\[3ex] & = \frac{1}{n}\sum_{i=1}^nE[X_i] \\[3ex] & = \frac{nE[X_i]}{n} \\[2ex] & = E[X_i] = \mu \end{alignedat}

Since E[Xˉ]=μE[\bar{X}] = \mu, Xˉ\bar{X} is always unbiased for μ\mu. That's why we call it the sample mean.

Similarly, suppose we have random variables, X1,...,XnX_1,...,X_n which are iid Exp(λ)\text{Exp}(\lambda). Then, Xˉ\bar{X} is unbiased for μ=E[Xi]=1/λ\mu = E[X_i] = 1 / \lambda. Even though λ\lambda is unknown, we know that Xˉ\bar{X} is a good estimator for 1/λ1/ \lambda.

Be careful, though. Just because Xˉ\bar{X} is unbiased for 1/λ1 / \lambda does not mean that 1/Xˉ1 / \bar{X} is unbiased for λ\lambda: E[1/Xˉ]1/E[Xˉ]=λE[1/\bar{X}] \neq 1 /E[\bar{X}] = \lambda. In fact, 1/Xˉ1/\bar{X} is biased for λ\lambda in this exponential case.

Here's another example. Suppose that random variables, X1,...,XnX_1,...,X_n are iid anything with mean μ\mu and variance σ2\sigma^2. Then:

E[S2]=E[i=1n(XiXˉ)2n1]=Var(Xi)=σ2E[S^2] = E\left[\frac{\sum_{i=1}^n(X_i - \bar{X})^2}{n - 1}\right] = \text{Var}(X_i) = \sigma^2

Since E[S2]=σ2E[S^2] = \sigma^2, S2S^2 is always unbiased for σ2\sigma^2. That's why we called it the sample variance.

For example, suppose random variables X1,...,XnX_1,...,X_n are iid Exp(λ)\text{Exp}(\lambda). Then S2S^2 is unbiased for Var(Xi)=1/λ2\text{Var}(X_i) = 1 / \lambda^2.

Let's give a proof for the unbiasedness of S2S^2 as an estimate for σ2\sigma^2. First, let's convert S2S^2 into a better form:

S2=i=1n(XiXˉ)2n1=i=1nXi22XiXˉ+Xˉ2n1=1n1(i=1nXi22XiXˉ+Xˉ2)=1n1(i=1nXi2i=1n2XiXˉ+i=1nXˉ2)\begin{alignedat}{1} S^2 & = \frac{\sum_{i=1}^n(X_i - \bar{X})^2}{n - 1} \\[3ex] & = \frac{\sum_{i=1}^n X_i^2 -2X_i\bar{X} + \bar{X}^2}{n - 1} \\[3ex] & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 -2X_i\bar{X} + \bar{X}^2\right) \\[3ex] & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 - \sum_{i=1}^n 2X_i\bar{X} + \sum_{i=1}^n \bar{X}^2\right) \\[3ex] \end{alignedat}

Let's rearrange the middle sum:

i=1n2XiXˉ=2Xˉi=1nXi \sum_{i=1}^n 2X_i\bar{X} = 2\bar{X}\sum_{i=1}^n X_i

Remember that Xˉ\bar{X} represents the average of all the XiX_i's: Xi/n\sum X_i / n. Thus, if we just sum the XiX_i's and don't divide by nn, we have a quantity equal to nXˉn\bar{X}:

2Xˉi=1nXi=2XˉnXˉ=2nXˉ22\bar{X}\sum_{i=1}^n X_i = 2\bar{X}n\bar{X} = 2n\bar{X}^2

Now, back to action:

S2=1n1(i=1nXi2i=1n2XiXˉ+i=1nXˉ2)=1n1(i=1nXi22nXˉ2+nXˉ2)=1n1(i=1nXi2nXˉ2)=i=1nXi2nXˉ2n1\begin{alignedat}{1} S^2 & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 - \sum_{i=1}^n 2X_i\bar{X} + \sum_{i=1}^n \bar{X}^2\right) \\[3ex] & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 - 2n\bar{X}^2 + n\bar{X}^2\right) \\[3ex] & = \frac{1}{n-1}\left(\sum_{i=1}^n X_i^2 - n\bar{X}^2 \right) \\[3ex] & = \frac{\sum_{i=1}^n X_i^2 - n\bar{X}^2}{n-1} \\[3ex] \end{alignedat}

Let's take the expected value:

E[S2]=i=1nE[Xi2]nE[Xˉ2]n1\begin{alignedat}{1} E[S^2] & = \frac{\sum_{i=1}^n E[X_i^2] - nE[\bar{X}^2]}{n-1} \end{alignedat}

Note that E[Xi2]E[X_i^2] is the same for all XiX_i, so the sum is just nE[X12]nE[X_1^2]:

E[S2]=nE[X12]nE[Xˉ2]n1=nn1(E[X12]E[Xˉ2])\begin{alignedat}{1} E[S^2] & = \frac{n E[X_1^2] - nE[\bar{X}^2]}{n-1} \\[2ex] & = \frac{n}{n-1} \left(E[X_1^2] - E[\bar{X}^2]\right) \end{alignedat}

We know that Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2, so E[X2]=Var(X)+(E[X])2E[X^2] = \text{Var}(X) + (E[X])^2. Therefore:

E[S2]=nn1(Var(X1)+(E[X1]2)Var(Xˉ)(E[Xˉ]2))\begin{alignedat}{1} E[S^2] & = \frac{n}{n-1} \left(\text{Var}(X_1) + (E[X_1]^2) - \text{Var}(\bar{X}) - (E[\bar{X}]^2)\right) \end{alignedat}

Remember that E[X1]=E[Xˉ]E[X_1] = E[\bar{X}], so:

E[S2]=nn1(Var(X1)Var(Xˉ))\begin{alignedat}{1} E[S^2] & = \frac{n}{n-1} \left(\text{Var}(X_1) - \text{Var}(\bar{X}) \right) \\[3ex] \end{alignedat}

Furthermore, remember that Var(Xˉ)=Var(X1)/n=σ2/n\text{Var}(\bar{X}) = \text{Var}(X_1) /n = \sigma_2 / n. Therefore:

=1n1(σ2σ2/n))=nσ2n1σ2n1=nσ2σ2n1=σ2(n1)n1=σ2\begin{alignedat}{1} & = \frac{1}{n-1} \left(\sigma^2 - \sigma^2/n) \right) \\[3ex] & = \frac{n\sigma^2}{n-1} - \frac{\sigma^2}{n-1} \\[3ex] & = \frac{n\sigma^2- \sigma^2}{n-1} = \frac{\sigma^2(n-1)}{n-1} = \sigma^2 \end{alignedat}

Unfortunately, while S2S^2 is unbiased for the variance σ2\sigma^2, SS is biased for the standard deviation σ\sigma.

Mean Squared Error

In this lesson, we'll look at mean squared error, a performance measure that evaluates an estimator by combining its bias and variance.

Bias and Variance

We want to choose an estimator with the following properties:

  • Low bias (defined as the difference between the estimator's expected value and the true parameter value)
  • Low variance

Furthermore, we want the estimator to have both of these properties simultaneously. If the estimator has low bias but high variance, then its estimates are meaninglessly noisy. Its average estimate is correct, but any individual estimate may be way off the mark. On the other hand, an estimator with low variance but high bias is very confident about the wrong answer.

Example

Suppose that we have nn random variables, X1,...,XniidUnif(0,θ)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Unif}(0,\theta). We know that our observations have a lower bound of 00, but we don't know the value of the upper bound, θ\theta. As is often the case, we sample many observations from the distribution and use that sample to estimate the unknown parameter.

Consider two estimators for θ\theta:

Y12XˉY2n+1nmax1iXiXi\begin{alignedat}{1} Y_1 &\equiv 2\bar{X} \\[2ex] Y_2 &\equiv \frac{n+1}{n} \max_{1 \leq i \leq X_i}X_i \end{alignedat}

Let's look at the first estimator. We know that E[Y1]=2E[Xˉ]E[Y_1] = 2E[\bar{X}], by definition. Similarly, we know that 2E[Xˉ]=2E[Xi]2E[\bar{X}] = 2E[X_i], since Xˉ\bar{X} is always unbiased for the mean. Recall how we compute the expected value for a uniform random variable:

E[A]=(ba)/2,AUnif(a,b)E[A] = (b - a) / 2, \quad A \sim \text{Unif}(a,b)

Therefore:

2E[Xi]=2(θ02)=θ=E[Y1]2E[X_i] = 2\left(\frac{\theta - 0}{2}\right) = \theta = E[Y_1]

As we can see, Y1Y_1 is unbiased for θ\theta.

It's also the case that Y2Y_2 is unbiased, but it takes more work to demonstrate. As a first step, take the cdf of the maximum of the XiX_i's, MmaxiXiM \equiv \max_iX_i. Here's what P(My)P(M \leq y) looks like:

P(My)=P(X1y and X2y and  and Xny)P(M \leq y) = P(X_1 \leq y \text{ and } X_2 \leq y \text{ and } \dots \text{ and } X_n \leq y)

If MyM \leq y, and MM is the maximum, then P(My)P(M \leq y) is the probability that all the XiX_i's are less than yy. Since the XiX_i's are independent, we can take the product of the individual probabilities:

P(My)=i=inP(Xiy)=[P(X1y)]nP(M \leq y) = \prod_{i = i}^n P(X_i \leq y) = [P(X_1 \leq y)]^n

Now, we know, by definition, that the cdf is the integral of the pdf. Remember that the pdf for a uniform distribution, Unif(a,b)\text{Unif}(a,b), is:

f(x)=1ba,a<x<bf(x) = \frac{1}{b-a}, a < x < b

Let's rewrite P(My)P(M \leq y):

P(My)=[0yfX1(x)dx]n=[0y1θdx]n=(yθ)n\begin{alignedat}{1} P(M \leq y) &= \left[\int_0^yf_{X_1}(x)dx\right]^n \\[2ex] & = \left[\int_0^y\frac{1}{\theta}dx\right]^n \\[2ex] & = \left(\frac{y}{\theta}\right)^n \end{alignedat}

Again, we know that the pdf is the derivative of the cdf, so:

fM(y)=FM(y)dy=ddyynθn=nyn1θn\begin{alignedat}{1} f_M(y) & = F_M(y)dy \\[2ex] & = \frac{d}{dy} \frac{y^n}{\theta^n} \\[2ex] & = \frac{ny^{n-1}}{\theta^n} \end{alignedat}

With the pdf in hand, we can get the expected value of MM:

E[M]=0θyfM(y)dy=0θnynθndy=nyn+1(n+1)θn0θ=nθn+1(n+1)θn=nθn+1\begin{alignedat}{1} E[M] & = \int_0^\theta yf_M(y)dy \\[2ex] & = \int_0^\theta \frac{ny^n}{\theta^n} dy \\[2ex] & = \frac{ny^{n+1}}{(n+1)\theta^n}\Big|_0^\theta \\[2ex] & = \frac{n\theta^{n+1}}{(n+1)\theta^n} \\[2ex] & = \frac{n\theta}{n+1} \\[2ex] \end{alignedat}

Note that E[M]θE[M] \neq \theta, so MM is not an unbiased estimator for θ\theta. However, remember how we defined Y2Y_2:

Y2n+1nmax1iXiXiY_2 \equiv \frac{n+1}{n} \max_{1 \leq i \leq X_i}X_i

Thus:

E[Y2]=n+1nE[M]=n+1n(nθn+1)=θ\begin{alignedat}{1} E[Y_2] & = \frac{n+1}{n}E[M] \\[2ex] & = \frac{n+1}{n} \left(\frac{n\theta}{n+1}\right) \\[2ex] & = \theta \end{alignedat}

Therefore, Y2Y_2 is unbiased for θ\theta.

Both indicators are unbiased, so which is better? Let's compare variances now. After similar algebra, we see:

Var(Y1)=θ23n,Var(Y2)=θ2n(n+2)\text{Var}(Y_1) = \frac{\theta^2}{3n}, \quad \text{Var}(Y_2) = \frac{\theta^2}{n(n+2)}

Since the variance of Y2Y_2 involves dividing by n2n^2, while the variance of Y1Y_1 only divides by nn, Y2Y_2 has a much lower variance than Y1Y_1 and is, therefore, the better indicator.

Bias and Mean Squared Error

The bias of an estimator, T[X]T[\bold{X}], is the difference between the estimator's expected value and the value of the parameter its trying to estimate: Bias(T)E[T]θ\text{Bias}(T) \equiv E[T] - \theta. When E[T]=θE[T] = \theta, then the bias is 00 and the estimator is unbiased.

The mean squared error of an estimator, T[X]T[\bold{X}], the expected value of the squared deviation of the estimator from the parameter: MSE(T)E[(Tθ)2]\text{MSE}(T) \equiv E[(T-\theta)^2].

Remember the equation for variance:

Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2

Using this equation, we can rewrite MSE(T)\text{MSE}(T):

MSE(T)=E[(Tθ)2]=Var(T)+(E[T]θ)2=Var(T)+Bias2\begin{alignedat}{1} \text{MSE}(T) & = E[(T-\theta)^2] \\[2ex] & = \text{Var}(T) + (E[T] - \theta)^2 \\[2ex] & = \text{Var}(T) + \text{Bias}^2 \\[2ex] \end{alignedat}

Usually, we use mean squared error to evaluate estimators. As a result, when selecting between multiple estimators, we might not choose the unbiased estimator, so long as that estimator's MSE is the lowest among the options.

Relative Efficiency

The relative efficiency of one estimator, T1T_1, to another, T2T_2, is the ratio of the mean squared errors: MSE(T1)/MSE(T2)\text{MSE}(T_1) / \text{MSE}(T_2). If the relative efficiency is less than one, we want T1T_1; otherwise, we want T2T_2.

Let's compute the relative efficiency of the two estimators we used in the previous example:

Y12XˉY2n+1nmax1iXiXi\begin{alignedat}{1} Y_1 &\equiv 2\bar{X} \\[2ex] Y_2 &\equiv \frac{n+1}{n} \max_{1 \leq i \leq X_i}X_i \end{alignedat}

Remember that both estimators are unbiased, so the bias is zero by definition. As a result, the mean squared errors of the two estimators is determined solely by the variance:

MSE(Y1)=Var(Y1)=θ23n,MSE(Y2)=Var(Y2)=θ2n(n+2)\text{MSE}(Y_1) = \text{Var}(Y_1) = \frac{\theta^2}{3n}, \quad \text{MSE}(Y_2) = \text{Var}(Y_2) = \frac{\theta^2}{n(n+2)}

Let's calculate the relative efficiency:

e(Y1,Y2)=θ23nθ2n(n+2)=θ2n(n+2)3nθ2=n(n+2)3n=n+23\begin{alignedat}{1} e(Y_1, Y_2) & = \frac{\frac{\theta^2}{3n}}{\frac{\theta^2}{n(n+2)}} \\[3ex] & = \frac{\theta^2 * n(n+2)}{3n\theta^2} \\[2ex] & = \frac{n(n+2)}{3n} \\[2ex] & = \frac{n+2}{3} \\[2ex] \end{alignedat}

The relative efficiency is greater than one for all n>1n > 1, so Y2Y_2 is the better estimator just about all the time.

Maximum Likelihood Estimation

In this lesson, we are going to talk about maximum likelihood estimation, which is perhaps the most important point estimation method. It's a very flexible technique that many software packages use to help estimate parameters from various distributions.

Likelihood Function and Maximum Likelihood Estimator

Consider an iid random sample, X1,...,XnX_1,...,X_n, where each XiX_i has pdf/pmf f(x)f(x). Additionally, suppose that θ\theta is some unknown parameter from XiX_i that we would like to estimate. We can define a likelihood function, L(θ)L(\theta) as:

L(θ)i=1nf(xi)L(\theta) \equiv \prod_{i=1}^n f(x_i)

The maximum likelihood estimator (MLE) of θ\theta is the value of θ\theta that maximizes L(θ)L(\theta). The MLE is a function of the XiX_i's and is itself a random variable.

Exponential Example

Consider a random sample, X1,...,XniidExp(λ)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Exp}(\lambda). Find the MLE for λ\lambda. Note that, in this case, λ\lambda, is taking the place of the abstract parameter, θ\theta. Now:

L(λ)i=1nf(xi)L(\lambda) \equiv \prod_{i=1}^n f(x_i)

We know that exponential random variables have the following pdf:

f(x,λ)=λeλxf(x, \lambda) = \lambda e^{-\lambda x}

Therefore:

L(λ)i=1nλeλxiL(\lambda) \equiv \prod_{i=1}^n \lambda e^{-\lambda x_i}

Let's expand the product:

L(λ)=(λeλx1)(λeλx2)(λeλxn)L(\lambda) = (\lambda e^{-\lambda x_1}) * (\lambda e^{-\lambda x_2}) * \dots * (\lambda e^{-\lambda x_n})

We can pull out a λn\lambda^n term:

L(λ)=λn(eλx1)(eλx2)(eλxn)L(\lambda) = \lambda^n * (e^{-\lambda x_1}) * (e^{-\lambda x_2}) * \dots * (e^{-\lambda x_n})

Remember what happens to exponents when we multiply bases:

axay=ax+ya^x * a^y = a^{x+y}

Let's apply this to our product (and we can swap in exp\exp notation to make things easier to read):

L(λ)=λnexp[λi=1nxi]L(\lambda) = \lambda^n \exp\left[-\lambda \sum_{i=1}^nx_i\right]

Now, we need to maximize L(λ)L(\lambda) with respect to λ\lambda. We could take the derivative of L(λ)L(\lambda), but we can use a trick! Since the natural log function is one-to-one, the λ\lambda that maximizes L(λ)L(\lambda) also maximizes ln(L(λ))\ln(L(\lambda)). Let's take the natural log of L(λ)L(\lambda):

ln(L(λ))=ln(λnexp[λi=1nxi])\ln(L(\lambda)) = \ln\left(\lambda^n \exp\left[-\lambda \sum_{i=1}^nx_i\right]\right)

Let's remember three different rules here:

ln(ab)=ln(a)+ln(b),ln(ab)=bln(a),ln(ex)=x\ln(ab) = \ln(a) + \ln(b), \quad \ln(a^b) = b\ln(a), \quad \ln(e^x) = x

Therefore:

ln(L(λ))=ln(λn)+ln(exp[λi=1nxi])=nln(λ)+ln(exp[λi=1nxi])=nln(λ)λi=1nxi\begin{alignedat}{1} \ln(L(\lambda)) & = \ln(\lambda^n) +\ln\left(\exp\left[-\lambda \sum_{i=1}^nx_i\right]\right) \\[2ex] & = n\ln(\lambda) +\ln\left(\exp\left[-\lambda \sum_{i=1}^nx_i\right]\right) \\[2ex] & = n\ln(\lambda) -\lambda \sum_{i=1}^nx_i \\[2ex] \end{alignedat}

Now, let's take the derivative:

ddλln(L(λ))=ddλ(nln(λ)λi=1nxi)=nλi=1nxi\begin{alignedat}{1} \frac{d}{d\lambda}\ln(L(\lambda)) & = \frac{d}{d\lambda}\left(n\ln(\lambda) -\lambda \sum_{i=1}^nx_i\right) \\[2ex] & = \frac{n}{\lambda} - \sum_{i=1}^nx_i \\[2ex] \end{alignedat}

Finally, we set the derivative equal to 00 and solve for λ\lambda:

0nλi=1nxinλ=i=1nxiλ=ni=1nxiλ=1Xˉ\begin{alignedat}{1} 0 \equiv \frac{n}{\lambda} & - \sum_{i=1}^nx_i \\[2ex] \frac{n}{\lambda} & = \sum_{i=1}^nx_i \\[3ex] \lambda & = \frac{n}{\sum_{i=1}^nx_i} \\[2ex] \lambda & = \frac{1}{\bar{X}} \\[2ex] \end{alignedat}

Thus, the maximum likelihood estimator for λ\lambda is 1/Xˉ1 / \bar{X}, which makes a lot of sense. The mean of the exponential distribution is 1/λ1 / \lambda, and we usually estimate that mean by Xˉ\bar{X}. Since Xˉ\bar{X} is a good estimator for λ\lambda, it stands to reason that a good estimator for λ\lambda is 1/Xˉ1 / \bar{X}.

Conventionally, we put a "hat" over the λ\lambda that maximizes the likelihood function to indicate that it is the MLE. Such notation looks like this: λ^\widehat{\lambda}.

Note that we went from "little x's", xix_i, to "big x", Xˉ\bar{X}, in the equation. We do this to indicate that λ^\widehat{\lambda} is a random variable.

Just to be careful, we probably should have performed a second-derivative test on the function, ln(L(λ))\ln(L(\lambda)), to ensure that we found a maximum likelihood estimator and not a minimum likelihood estimator.

Bernoulli Example

Let's look at a discrete example. Suppose we have X1,...,XniidBern(p)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Bern}(p). Let's find the MLE for pp. We might remember that the expected value of Bern(pp) random variable is pp, so we shouldn't be surprised if Xˉ\bar X is our MLE.

Let's remember the values that XiX_i can take:

Xi={1w.p. p0w.p. 1pX_i = \begin{cases} 1 & \text{w.p. } p \\ 0 & \text{w.p. } 1-p \end{cases}

Therefore, we can write the pmf for a Bern(pp) random variable as follows:

f(x)=px(1p)1x,x=0,1f(x) = p^x(1-p)^{1-x}, \quad x = 0, 1

Now, let's calculate L(p)L(p). First:

L(p)=i=1nf(xi)=pxi(1p)1xiL(p) = \prod_{i=1}^n f(x_i) = \prod p^{x_i}(1-p)^{1-x_i}

Remember that axay=ax+ya^x * a^y = a^{x+y}. So:

pxi(1p)1xi=pi=1nxi(1p)ni=1nxi\prod p^{x_i}(1-p)^{1-x_i} = p^{\sum_{i=1}^n x_i}(1-p)^{n-\sum_{i=1}^n x_i}

Let's take the natural log of both sides, remembering that ln(ab)=ln(a)+ln(b)\ln(ab) = \ln(a) + \ln(b):

ln(L(p))=i=1nxiln(p)+(ni=1nxi)ln(1p)\ln(L(p)) = \sum_{i=1}^n x_i\ln(p) + (n-\sum_{i=1}^n x_i)\ln(1-p)

Let's take the derivative, remembering that the derivative of ln(1p)\ln(1-p) equals 1/(1p)-1/(1-p):

ddpln(L(p))=i=1nxipni=1nxi1p\frac{d}{dp}\ln(L(p)) = \frac{\sum_{i=1}^n x_i}{p} - \frac{n-\sum_{i=1}^n x_i}{1-p}

Now we can set the derivative equal to zero, and solve for p^\widehat p:

i=1nxip^ni=1nxi1p^=0i=1nxip^=ni=1nxi1p^i=1nxini=1nxi=p^1p^Xˉ1=p^1Xˉ=p^\begin{alignedat}{1} \frac{\sum_{i=1}^n x_i}{\widehat p} &- \frac{n-\sum_{i=1}^n x_i}{1-\widehat p} = 0 \\ \frac{\sum_{i=1}^n x_i}{\widehat p} &= \frac{n-\sum_{i=1}^n x_i}{1-\widehat p} \\ \frac{\sum_{i=1}^n x_i}{n-\sum_{i=1}^n x_i} &= \frac{\widehat p}{1-\widehat p} \\[3ex] \bar X - 1 &= \widehat p - 1 \\ \bar X &= \widehat p \end{alignedat}

MLE Examples

In this lesson, we'll look at some additional MLE examples. MLEs will become very important when we eventually carry out goodness-of-fit tests.

Normal Example

Suppose we have X1,...,XniidNor(μ,σ2)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Nor}(\mu, \sigma^2). Let's find the simultaneous MLE's for μ\mu and σ2\sigma^2:

L(μ,σ2)=i=1nf(xi)=i=1n12πσ2exp{12(xiμ)2σ2}=1(2πσ2)n/2exp{12i=1n(xiμ)2σ2}\begin{alignedat}{1} L(\mu, \sigma^2) &= \prod_{i=1}^n f(x_i) \\ &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{1}{2}\frac{(x_i - \mu)^2}{\sigma^2}\right\} \\ &= \frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left\{-\frac{1}{2}\sum_{i=1}^n\frac{(x_i - \mu)^2}{\sigma^2}\right\} \end{alignedat}

Let's take the natural log:

ln(L(μ,σ2))=n2ln(2π)n2ln(σ2)12σ2i=1n(xiμ)2\ln(L(\mu, \sigma^2)) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2

Let's take the first derivative with respect to μ\mu, to find the MLE, μ^\widehat\mu, for μ\mu. Remember that the derivative of terms that don't contain μ\mu are zero:

μln(L(μ,σ2))=μ12σ2i=1n(xiμ)2\frac{\partial}{\partial\mu} \ln(L(\mu, \sigma^2)) = \frac{\partial}{\partial\mu} \frac{-1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2

What's the derivative of (xiμ)2(x_i - \mu)^2 with respect to μ\mu? Naturally, it's 2(xiμ)-2(x_i - \mu). Therefore:

μln(L(μ,σ2))=1σ2i=1n(xiμ)\frac{\partial}{\partial\mu} \ln(L(\mu, \sigma^2)) = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i - \mu)

We can set the expression on the right equal to zero and solve for μ^\widehat\mu:

0=1σ2i=1n(xiμ^)0 = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i - \widehat\mu)

If we solve for μ^\widehat\mu, we see that μ^=Xˉ\widehat\mu = \bar X, which we expect. In other words, the MLE for the true mean, μ\mu, is the sample mean, Xˉ\bar X.

Now, let's take the derivative of ln(L(μ,σ2))\ln(L(\mu,\sigma^2)) with respect to σ2\sigma^2. Consider:

σ2ln(L(μ,σ2))=n2σ2+12σ4i=1n(xiμ^)2\frac{\partial}{\partial\sigma^2} \ln(L(\mu, \sigma^2)) = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_{i=1}^n (x_i - \widehat\mu)^2

We can set the expression on the right equal to zero and solve for σ2^\widehat{\sigma^2}:

n2σ2^+12σ4^i=1n(xiμ^)2=012σ4^i=1n(xiμ^)2=n2σ2^1ni=1n(xiμ^)2=2σ4^2σ2^i=1n(XiXˉ)2n=σ2^\begin{alignedat}{1} -\frac{n}{2\widehat{\sigma^2}} + \frac{1}{2\widehat{\sigma^4}}\sum_{i=1}^n (x_i - \widehat\mu)^2 &= 0 \\ \frac{1}{2\widehat{\sigma^4}}\sum_{i=1}^n (x_i - \widehat\mu)^2 &= \frac{n}{2\widehat{\sigma^2}} \\ \frac{1}{n}\sum_{i=1}^n (x_i - \widehat\mu)^2 &= \frac{2\widehat{\sigma^4}}{2\widehat{\sigma^2}} \\ \frac{\sum_{i=1}^n(X_i - \bar X)^2}{n} &= \widehat{\sigma^2} \end{alignedat}

Notice how close σ2^\widehat{\sigma^2} is to the unbiased sample variance:

S2=i=1n(XiXˉ)2n1=nσ2^n1S^2 = \frac{\sum_{i=1}^n(X_i - \bar X)^2}{n - 1} = \frac{n\widehat{\sigma^2}}{n-1}

Because S2S^2 is unbiased, we have to expect that σ2^\widehat{\sigma^2} is slightly biased. However, σ2^\widehat{\sigma^2} has slightly less variance than S2S^2, making it the MLE. Regardless, the two quantities converge as nn grows.

Gamma Example

Let's look at the Gamma distribution, parameterized by rr and θ\theta. The pdf for this distribution is shown below. Recall that Γ(r)\Gamma(r) is the gamma function.

f(x)=λrΓ(r)xr1eλx,x>0f(x) = \frac{\lambda^r}{\Gamma(r)}x^{r-1}e^{-\lambda x}, \quad x > 0

Suppose we have X1,...,XniidGam(r,λ)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Gam}(r, \lambda). Let's find the MLE's for rr and λ\lambda:

L(r,λ)=i=1nf(xi)=λnr[Γ(r)]n(i=1nxi)r1eλixi\begin{alignedat}{1} L(r, \lambda) &= \prod_{i=1}^n f(x_i) \\[3ex] &= \frac{\lambda^{nr}}{[\Gamma(r)]^n} \left(\prod_{i=1}^n x_i\right)^{r-1}e^{-\lambda\sum_i x_i} \end{alignedat}

Let's take the natural logarithm of both sides, remembering that ln(a/b)=ln(a)ln(b)\ln(a/b) = \ln(a) - \ln(b):

ln(L)=rnln(λ)nln(Γ(r))+(r1)ln(ixi)λixi\ln(L) = rn\ln(\lambda) - n\ln(\Gamma(r)) + (r-1)\ln\left(\prod_ix_i\right) - \lambda\sum_ix_i

Let's get the MLE of λ\lambda first by taking the derivative with respect to λ\lambda. Notice that the middle two terms disappear:

λln(L)=rnλi=1nxi\frac{\partial}{\partial\lambda}\ln(L) = \frac{rn}{\lambda} - \sum_{i=1}^n x_i

Let's set the expression on the right equal to zero and solve for λ^\widehat\lambda:

0=r^nλ^i=1nxii=1nxi=r^nλ^λ^=r^ni=1nxi=r^/Xˉ\begin{alignedat}{1} 0 &= \frac{\widehat rn}{\widehat\lambda} - \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i &= \frac{\widehat rn}{\widehat\lambda} \\ \widehat\lambda &= \frac{\widehat rn}{\sum_{i=1}^n x_i} = \widehat r / \bar X\\ \end{alignedat}

It turns out the mean of the Gamma distribution is r/λr / \lambda. If we pretend that Xˉ=μ\bar X = \mu then we can see how, with a simple rearrangement, that λ^=r^/Xˉ\widehat\lambda = \widehat r / \bar X.

Now let's find the MLE of rr. First, we take the derivative with respect to rr:

rln(L)=nln(λ)nΓ(r)ddrΓ(r)+ln(ixi)\frac{\partial}{\partial r}\ln(L) = n\ln(\lambda) - \frac{n}{\Gamma(r)}\frac{d}{dr}\Gamma(r) + \ln\left(\prod_ix_i\right)

We can define the digamma function, Ψ(r)\Psi(r), to help us with the term involving the gamma function and it's derivative:

Ψ(r)Γ(r)/Γ(r)\Psi(r) \equiv \Gamma'(r) / \Gamma(r)

At this point, we can substitute in λ^=r^/Xˉ\widehat\lambda = \widehat r/\bar X, and then use a computer to solve the following equation, either by bisection, Newton's method, or some other method:

nln(r^/Xˉ)nΨ(r)+ln(ixi)=0n\ln(\widehat r/\bar X) - n\Psi(r) + \ln\left(\prod_ix_i\right) = 0

The challenging part of evaluating the digamma function is computing the derivative of the gamma function. We can use the definition of the derivative here to help us, choosing our favorite small hh and then evaluating:

Γ(r)Γ(r+h)Γ(r)h\Gamma'(r) \approx \frac{\Gamma(r+h) - \Gamma(r)}{h}

Uniform Example

Suppose we have X1,...,XniidUnif(0,θ)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Unif}(0, \theta). Let's find the MLE for θ\theta.

Remember that the pdf f(x)=1/θ,0<x<θf(x) = 1/\theta, 0 < x < \theta. We can take the likelihood function as the product of the f(xi)f(x_i)'s:

L(θ)=i=1nf(xi)={1/θn,if 0xiθ,i0otherwiseL(\theta) = \prod_{i=1}^n f(x_i) = \begin{cases} 1/\theta^n,& \text{if } 0 \leq x_i \leq \theta, \forall i \\ 0 & \text{otherwise} \end{cases}

In order to have L(θ)>0L(\theta) > 0, we must have 0xiθ,i0 \leq x_i \leq \theta, \forall i. In other words, θ\theta must be at least as large as the largest observation we've seen yet: θmaxixi\theta \geq \max_i x_i.

Subject to this constraint, L(θ)=1/θnL(\theta) = 1 / \theta^n is not maximized at θ=0\theta = 0. Instead L(θ)=1/θnL(\theta) = 1 / \theta^n is maximized at the smallest possible θ\theta value, namely θ^=maxiXi\widehat\theta = \max_i X_i.

This result makes sense in light of the similar (unbiased) estimator Y2=(n+1)maxiXi/nY_2 = (n+1)\max_i X_i/n that we saw previously.

Invariance Properties of MLEs

In this lesson, we will expand the vocabulary of maximum likelihood estimators by looking at the invariance property of MLEs. In a nutshell, if we have the MLE for some parameter, then we can use the invariance property to determine the MLE for any reasonable function of that parameter.

Invariance Property of MLE's

If θ^\widehat{\theta} is the MLE of some parameter, θ\theta, and h()h(\cdot) is a 1:1 function, then h(θ^)h(\widehat{\theta}) is the MLE of h(θ)h(\theta).

Remember that this invariance property does not hold for unbiasedness. For instance, we said previously that the sample variance is an unbiased estimator for the true variance because E[S2]=σ2E[S^2] = \sigma^2. However, E[S2]σE[\sqrt{S^2}] \neq \sigma, so we cannot use the sample standard deviation as an unbiased estimator for the true standard deviation.

Examples

Suppose we have a random sample, X1,...,XniidBern(p)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Bern}(p). We might remember that the MLE of pp is p^=Xˉ\widehat p = \bar X. If we consider the 1:1 function h(θ)=θ2,θ>0h(\theta) = \theta^2, \theta > 0, then the invariance property says that the MLE of p2p^2 is Xˉ2\bar X^2.

Suppose we have a random sample, X1,...,XniidNor(μ,σ2)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Nor}(\mu, \sigma^2). We saw previously that the MLE for σ2\sigma^2 is:

σ2^=1ni=1n(XiXˉ)2\widehat{\sigma^2} = \frac{1}{n} \sum_{i=1}^n (X_i - \bar X)^2

We just said that we couldn't take the square root of S2S^2 to estimate σ \sigma in an unbiased way. However, we can use the square root of σ2^\widehat{\sigma^2} to get the MLE for σ\sigma.

If we consider the 1:1 function h(θ)=+θh(\theta) = +\sqrt\theta, then the invariance property says that the MLE of σ\sigma is:

σ^=σ2^=i=1n(XiXˉ)2n\widehat\sigma = \sqrt{\widehat{\sigma^2}} = \sqrt\frac{\sum_{i=1}^n (X_i - \bar X)^2}{n}

Suppose we have a random sample, X1,...,XniidExp(λ)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Exp}(\lambda). The survival function, Fˉ(x)\bar{F}(x), is:

Fˉ(x)=P(X>x)=1F(x)=1(1eλx)=eλx\bar{F}(x) = P(X > x) = 1 - F(x) = 1 - (1 - e^{-\lambda x}) = e^{-\lambda x}

We saw previously the the MLE for λ\lambda is λ^=1/Xˉ\widehat\lambda = 1/\bar X.Therefore, using the invariance property, we can see that the MLE for Fˉ(x)\bar{F}(x) is Fˉ(λ^)\bar{F}(\widehat{\lambda}):

Fˉ(x)^=eλ^x=ex/Xˉ\widehat{\bar{F}(x)} = e^{-\widehat{\lambda}x} = e^{-x / \bar{X}}

The MLE for the survival function is used all the time in actuarial sciences to determine - somewhat gruesomely, perhaps - the probability that people will live past a certain age.

The Method of Moments (Optional)

In this lesson, we'll finish off our discussion on estimators by talking about the Method of Moments.

The Method Of Moments

The kkth moment of a random variable XX is:

E[Xk]={xxkf(x)if X is discreteRxkf(x)dxif X is continuousE[X^k] = \begin{cases} \sum_x x^k f(x) & \text{if X is discrete} \\ \int_{\mathbb{R}} x^k f(x)dx & \text{if X is continuous} \end{cases}

Suppose we have a sequence of random variables, X1,...,XnX_1,...,X_n, which are iid from pmf/pdf f(x)f(x). The method of moments (MOM) estimator for E[Xk]E[X^k], mkm_k, is:

mk=1ni=1nXikm_k = \frac{1}{n} \sum_{i=1}^n X_i^k

Note that mkm_k is equal to the sample average of the XikX_i^k's. Indeed, the MOM estimator for μ=E[Xi]\mu = E[X_i], is the sample mean, Xˉ\bar X:

m1=1ni=1nXi=Xˉ=E[Xi]m_1 = \frac{1}{n} \sum_{i=1}^n X_i = \bar X = E[X_i]

Similarly, we can find the MOM estimator for k=2k=2:

m2=1ni=1nXi2=E[Xi2]m_2 = \frac{1}{n}\sum_{i=1}^n X_i^2 = E[X_i^2]

We can combine the MOM estimators for k=1,2k=1,2 to produce an expression for the variance of XiX_i:

Var(Xi)=E[Xi2](E[Xi])2=1ni=1nXi2Xˉ2=i=1n(Xi2n)Xˉ2=i=1n(Xi2nXˉ2n)=1ni=1n(Xi2Xˉ2)=n1nS2\begin{alignedat}{1} \text{Var}(X_i) &= E[X_i^2] - (E[X_i])^2 \\ &= \frac{1}{n}\sum_{i=1}^n X_i^2 - \bar X^2 \\ &= \sum_{i=1}^n\left(\frac{X_i^2}{n}\right) - \bar X^2 \\ &= \sum_{i=1}^n\left(\frac{X_i^2}{n} - \frac{\bar X^2}{n} \right) \\ &= \frac{1}{n}\sum_{i=1}^n\left(X_i^2 - \bar X^2 \right) \\ &= \frac{n-1}{n}S^2 \end{alignedat}

Of course, it's perfectly okay to use S2S^2 to estimate the variance, and the two quantities converge as nn grows.

Poisson Example

Suppose that X1,...,XniidPois(λ)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Pois}(\lambda). We know that, for the Poisson distribution, E[Xi]=λE[X_i] = \lambda, so a MOM estimator for λ\lambda is Xˉ\bar X.

We might remember that the variance of the Poisson distribution is also λ\lambda, so another MOM estimator for λ\lambda is:

n1nS2\frac{n-1}{n}S^2

As we can see, we have two different estimators for λ\lambda, both of which are MOM estimators. In practice, we usually will use the easier-looking estimator if we have a choice.

Normal Example

Suppose that X1,...,XniidNor(μ,σ2)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Nor}(\mu, \sigma^2). We know that the MOM estimators for μ\mu and σ2\sigma^2 are Xˉ\bar X and (n1)S2/n(n-1)S^2 / n, respectively. For this example, these estimators happen to be the same as the MLEs.

Beta Example

Now let's look at a less trivial example. Here we might really rely on MOM estimators because we cannot find the MLEs so easily.

Suppose that X1,...,XniidBeta(a,b)X_1,...,X_n \overset{\text{iid}}{\sim} \text{Beta}(a, b). The beta distribution has the following pdf:

f(x)=Γ(a+b)Γ(a)Γ(b)xa1(1x)b1,0<x<1f(x) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}x^{a-1}(1-x)^{b-1}, \quad 0 < x < 1

After much algebra, it turns out that:

E[X]=aa+bVar(X)=ab(a+b)2(a+b+1)=E[X]b(a+b)(a+b+1)\begin{alignedat}{1} E[X] &= \frac{a}{a+b} \\[2ex] \text{Var}(X) &= \frac{ab}{(a+b)^2(a+b+1)} = \frac{E[X]b}{(a+b)(a+b+1)} \end{alignedat}

Let's find MOM estimators for aa and bb. Given the expected value above, let's solve for aa:

E[X]=aa+b(a+b)E[X]=aaE[X]+bE[X]=abE[X]=aaE[X]bE[X]=a(1E[X])bE[X]1E[X]=a\begin{alignedat}{1} E[X] &= \frac{a}{a+b} \\ (a+b)E[X] &= a \\ aE[X] + bE[X] &= a \\ bE[X] &= a - aE[X] \\ bE[X] &= a(1 - E[X]) \\ \frac{bE[X]}{1 - E[X]} &= a \end{alignedat}

Since we know that Xˉ\bar X is the MOM estimator for E[X]E[X], we have a reasonable approximation for aa:

abXˉ1Xˉa \approx \frac{b\bar X}{1-\bar X}

We can solve for bb by making the following substitutions into the variance equation above: Xˉ\bar X for E[X]E[X], S2S^2 for Var(X)\text{Var}(X), and the approximation of aa, in terms of Xˉ\bar X and bb, for aa. After a bunch of algebra, we have the MOM estimator for bb:

b(1Xˉ)2XˉS21+Xˉb \approx \frac{(1 - \bar X)^2\bar X}{S^2} - 1 + \bar X

Let's plug bb back into our approximation for aa to get aa in terms of S2S^2 and Xˉ\bar X:

a((1Xˉ)2XˉS21+Xˉ)Xˉ1Xˉ(1Xˉ)2Xˉ2S2+Xˉ2Xˉ1Xˉ(1Xˉ)2Xˉ2S2(1Xˉ)+Xˉ21XˉXˉ1Xˉ(1Xˉ)Xˉ2S2+Xˉ2Xˉ1Xˉ(1Xˉ)Xˉ2S2+Xˉ2Xˉ1XˉXˉ[(1Xˉ)XˉS2+Xˉ11Xˉ]Xˉ[(1Xˉ)XˉS21]\begin{alignedat}{1} a &\approx \frac{\left(\frac{(1 - \bar X)^2\bar X}{S^2} - 1 + \bar X\right)\bar X}{1-\bar X} \\[3ex] &\approx \frac{\frac{(1 - \bar X)^2\bar X^2}{S^2} + \bar X^2 - \bar X }{1-\bar X} \\[3ex] &\approx \frac{(1 - \bar X)^2\bar X^2}{S^2(1 - \bar X)} + \frac{\bar X^2}{1 - \bar X} - \frac{\bar X}{1 - \bar X} \\[3ex] &\approx \frac{(1 - \bar X)\bar X^2}{S^2} + \frac{\bar X^2 - \bar X}{1 - \bar X} \\[3ex] &\approx \frac{(1 - \bar X)\bar X^2}{S^2} + \frac{\bar X^2 - \bar X}{1 - \bar X} \\[3ex] &\approx \bar X \left[\frac{(1 - \bar X)\bar X}{S^2} + \frac{\bar X - 1}{1 - \bar X}\right] \\ &\approx \bar X \left[\frac{(1 - \bar X)\bar X}{S^2} - 1\right] \end{alignedat}

If we plug and chug with the following Xˉ\bar X and S2S^2, we should get the following values for the MOM estimators for aa and bb.

Goodness-of-Fit Tests

In this lesson, we'll start our discussion on goodness-of-fit tests. We use these tests to assess whether a particular simulation input distribution accurately reflects reality.

What We've Done So Far

Until now, we've guessed at reasonable distributions and estimated the relevant parameters based on the data. We've used different estimators, such as MLEs and MOM estimators. Now we will conduct a formal test to validate the work we've done. If our guesses and estimations are close, our tests should reflect that.

In particular, we will conduct a formal hypothesis test to determine whether a series of observations, X1,...,XnX_1,...,X_n, come from a particular distribution with pmf/pdf, f(x)f(x). Here's our null hypothesis:

H0:X1,...,Xniid p.m.f, / p.d.f. f(x)H_0: X_1,...,X_n \overset{\text{iid}}{\sim} \text{ p.m.f, / p.d.f. } f(x)

We will perform this hypothesis test at a level of significance, α\alpha, where:

αP(Reject H0H0 true)=P(Type I error)\alpha \equiv P(\text{Reject } H_0 | H_0 \text{ true}) = P(\text{Type I error})

As usual, we assume that H0H_0 is true, only rejecting it if we get ample evidence to the contrary. The distribution is innocent until proven guilty. Usually, we choose α\alpha to be 0.050.05 or 0.010.01.

High-Level View of Goodness-Of-Fit Test Procedure

Let's first divide the domain of f(x)f(x) into kk sets, A1,A2,...,AkA_1, A_2,...,A_k. If XX is discrete, then each set will consist of distinct points. If XX is continuous, then each set will contain an interval.

Second, we will tally the number of observations that fall into each set. We refer to this tally as Oi,i=1,2,...,kO_i, i = 1, 2, ..., k. For example, O1O_1 refers to the number of observations we see in A1A_1. Remember that iOi=n\sum_i O_i = n, where nn is the total number of observations we collect.

If piP(XAi)p_i \equiv P(X \in A_i), then OiBin(n,pi)O_i \sim \text{Bin}(n, p_i). In other words OiO_i counts the number of sucesses - landing in set AiA_i - given nn trials, where the probability of success is pip_i. Because OiO_i is binomial, the expected number of observations that fall in each set, assuming H0H_0 is Ei=E[Oi]=npi,i=1,2,...,kE_i = E[O_i] = np_i, i = 1,2,...,k.

Next, we calculate a test statistic based on the differences between the EiE_i's and OiO_i's. The chi-squared g-o-f test statistic is:

χ02i=1k(OiEi)2Ei\chi^2_0 \equiv \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}

If the distribution we've guessed fits the data well, then the EiE_i's and OiO_i's will be very close, and χ02\chi^2_0 will be small. On the other hand, if we've made a bad guess, χ02\chi^2_0 will be large.

As we said, a large value of χ02\chi^2_0 indicates a bad fit. In particular, we reject H0H_0 if:

X02>Xα,k1s2X^2_0 > X^2_{\alpha, k-1-s}

Remember that kk refers to the number of sets we have generated from the domain of f(x)f(x), and α\alpha refers to the level of significance at which we wish to conduct our test.

Here, ss refers to the number of unknown parameters from f(x)f(x) that have to be estimated. For example, if XExp(λ)X \sim \text{Exp}(\lambda), then s=1s= 1. If XNor(μ,σ2)X \sim \text{Nor}(\mu, \sigma^2), then s=2s = 2.

Additionally, χα,v2\chi^2_{\alpha, v} refers to the (1α)(1 - \alpha) quantile of the χv2\chi^2_v distribution. Specifically:

P(Xv2<χα,v2)=1αP(X^2_v < \chi^2_{\alpha, v}) = 1 - \alpha

If χ02Xα,k1s2\chi^2_0 \leq X^2_{\alpha, k-1-s}, we fail to reject H0H_0.

Remarks

In order to ensure that the test gives good results, we want to select kk such that Ei5E_i \geq 5 and pick n30n \geq 30.

If the degrees of freedom, v=k1sv = k - 1 - s, is large, than we can approximate χa,v2\chi^2_{a,v} using the corresponding standard normal quantile, zαz_\alpha:

χα,v2v[129v+zα29v]3\chi^2_{\alpha, v} \approx v\left[1 - \frac{2}{9v} + z_\alpha \sqrt\frac{2}{9v}\right]^3

If we don't want to use the χ2\chi^2 goodness-of-fit test, we can use a different test, such as Kolmogorov-Smirnov, Anderson-Darling, or Shapiro-Wilk, among others.

Uniform Example

Let's test our null hypothesis that a series of observations, X1,...XnX_1,...X_n, are iid Unif(0,1). Suppose we collect n=1000n=1000 observations, where 0Xi10 \leq X_i \leq 1, and we divide the unit interval into k=5k=5 sets. Consider the following OiO_i's and EiE_i's below:

interval[0,0.2](0.2,0.4](0.4,0.6](0.6,0.8](0.8,1.0]Ei200200200200200Oi179208222199192\begin{array}{c|ccccc} \text{interval} & [0,0.2] & (0.2, 0.4] & (0.4, 0.6] & (0.6, 0.8] & (0.8, 1.0] \\ \hline E_i & 200 & 200 & 200 & 200 & 200 \\ O_i & 179 & 208 & 222 & 199 & 192 \\ \end{array}

The OiO_i's refer to the actual number of observations that landed in each interval. Remember that, since OiBin(n,pi)O_i \sim \text{Bin}(n, p_i), E[Oi]=npi=10000.2=200E[O_i] = np_i = 1000 * 0.2 = 200.

Let's calculate our goodness-of-fit statistic:

χ02=i=1k(OiEi)2Ei=1200((179200)2+...+(192200)2)=5.27\begin{alignedat}{1} \chi^2_0 &= \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} \\ &= \frac{1}{200} ((179 - 200)^2 + ... + (192 - 200)^2) \\ &= 5.27 \end{alignedat}

Let's set our significance level to α=0.05\alpha = 0.05. Since there are no unknown parameters, s=0s = 0, so k1s=4k - 1 - s = 4. Therefore:

χα,k1s2=χ0.05,42=9.49 (tabled) \chi^2_{\alpha, k - 1 - s} = \chi^2_{0.05, 4} = 9.49 \text{ (tabled) }

Since χ02χ0.05,42\chi^2_0 \leq \chi^2_{0.05, 4}, we fail to reject the null hypothesis and begrudgingly accept that the XiX_i's are iid Unif(0,1).

Discrete Example

Let's hypothesize that the number of defects in a printed circuit board follows a Geometric(pp) distribution. Let's look at a random sample of n=70n = 70 boards and observe the number of defects. Consider the following table:

# defectsfrequency13421832495770\begin{array}{cc} \text{\# defects} & \text{frequency} \\ \hline 1 & 34 \\ 2 & 18 \\ 3 & 2 \\ 4 & 9 \\ 5 & 7 \\ \hline & 70 \end{array}

Now let's test the null hypothesis that X1,...,X70Geom(p)X_1,...,X_{70} \sim \text{Geom}(p). We can start by estimating pp via the MLE. The likelihood function is:

L(p)=i=1nf(xi)=i=1n(1p)xi1p=(1p)i=1nxinpn\begin{alignedat}{1} L(p) &= \prod_{i = 1}^nf(x_i) \\ &= \prod_{i = 1}^n(1-p)^{x_i - 1}p \\ &= (1-p)^{\sum_{i=1}^n x_i - n}p^n \end{alignedat}

Let's take the natural log of both sides:

ln(L(p))=ln((1p)i=1nxinpn)=(i=1nxin)ln(1p)+nln(p)\begin{alignedat}{1} \ln(L(p)) &= \ln((1-p)^{\sum_{i=1}^n x_i - n}p^n) \\ &= \left(\sum_{i=1}^n x_i - n\right)\ln(1-p) + n\ln(p) \\ \end{alignedat}

Now, let's take the derivative:

dln(L(p))dp=i=1nxin1p+np\frac{d\ln(L(p))}{dp} = \frac{-\sum_{i=1}^n x_i - n}{1-p} + \frac{n}{p}

Now, let's set the expression on the right to zero and solve for p^\widehat p:

0=i=1nxin1p^+np^i=1nxin1p^=np^i=1nxinn=1p^p^Xˉ1=1p^1p^=1Xˉ\begin{alignedat}{1} 0 = \frac{-\sum_{i=1}^n x_i - n}{1-\widehat p} & + \frac{n}{\widehat p} \\[3ex] \frac{\sum_{i=1}^n x_i - n}{1-\widehat p} & = \frac{n}{\widehat p} \\[3ex] \frac{\sum_{i=1}^n x_i - n}{n} & = \frac{1-\widehat p}{\widehat p} \\[3ex] \bar X - 1 &= \frac{1}{\widehat p} - 1 \\[3ex] \widehat{p} &= \frac{1}{\bar X} \end{alignedat}

We know that, for XGeom(p)X \sim \text{Geom}(p), E[X]=1/pE[X] = 1 / p. Therefore, it makes sense that our estimator, p^\widehat p, is equal to 1/Xˉ1 / \bar X. Anyway, let's compute p^\widehat p:

p^=1Xˉ=701(34)+2(18)+3(2)+4(9)+5(7)=0.476\widehat p = \frac{1}{\bar X} = \frac{70}{1(34) + 2(18) + 3(2) + 4(9) + 5(7)} = 0.476

Given p^=0.476\widehat p = 0.476, let's turn to the goodness-of-fit test statistic. We have our OiO_i's, and now we can compute our EiE_i's. By the invariance property of MLEs, the MLE for the expected number of boards, E^x\widehat E_x, having a particular number of defects, xx, is equal to nP(X=x)=n(1p^)x1p^nP(X=x) = n(1-\widehat p)^{x-1}\widehat p.

Consider the following table. Of course, XX can take values from 11 to \infty, so we'll condense P(X[5,))P(X \in [5, \infty)) into the last row of the table.

xP(X=x)E^xOx10.476233.333420.249417.461830.13079.15240.06844.79950.07525.2771.00007070\begin{array}{cccc} x & P(X = x) & \widehat E_x & O_x \\ \hline 1 & 0.4762 & 33.33 & 34 \\ 2 & 0.2494 & 17.46 & 18 \\ 3 & 0.1307 & 9.15 & 2 \\ 4 & 0.0684 & 4.79 & 9 \\ \geq 5 & 0.0752 & 5.27 & 7 \\ \hline & 1.0000 & 70 & 70 \end{array}

Remember that we said we'd like to ensure that Ei5E_i \geq 5 in order for the goodness-of-fit test to work correctly. Unfortunately, in this case, E4<5E_4 < 5. No problem. Let's just roll X=4X = 4 into X5X \geq 5:

xP(X=x)ExOx10.476233.333420.249417.461830.13079.15240.143610.06161.00007070\begin{array}{cccc} x & P(X = x) & E_x & O_x \\ \hline 1 & 0.4762 & 33.33 & 34 \\ 2 & 0.2494 & 17.46 & 18 \\ 3 & 0.1307 & 9.15 & 2 \\ \geq 4 & 0.1436 & 10.06 & 16 \\ \hline & 1.0000 & 70 & 70 \end{array}

Let's compute the test statistic:

χ02=x=14(ExOx)2Ex=(33.3334)233.33+...=9.12\chi^2_0 = \sum_{x=1}^4 \frac{(E_x - O_x)^2}{E_x} = \frac{(33.33 - 34)^2}{33.33} + ... = 9.12

Now, let's compare our test statistic to the appropriate χ2\chi^2 quantile. We know that k=4k = 4, since we partitioned the values that XX can take into four sets. We also know that s=1s = 1, since we had to estimate one parameter. Given α=0.05\alpha = 0.05:

χα,k1s2=χ0.05,22=5.99\chi^2_{\alpha, k-1-s} = \chi^2_{0.05, 2} = 5.99

Since χ02=9.12>χ0.05,22=5.99\chi^2_0 = 9.12 > \chi^2_{0.05, 2} = 5.99, we reject H0H_0 and conclude that the number of defects in circuit boards probably isn't geometric.

Exponential Example

In this lesson, we'll apply a goodness-of-fit test for the exponential distribution. It turns out that we can apply the general recipe we'll walk through here to other distributions as well.

Continuous Distributions

For continuous distributions, let's denote the intervals Ai(ai1,ai],i=1,2,...,kA_i \equiv (a_{i-1}, a_i], i = 1,2,...,k. For convenience, we want to choose the aia_i's such that XX has an equal probability of landing in any interval, AiA_i. In other words:

pi=P(XAi)=P(ai1<Xai)=1/k,for all ip_i = P(X \in A_i) = P(a_{i-1} < X \leq a_i) = 1/k, \quad \text{for all } i

Example

Suppose that we're interested in fitting a distribution to a series of interarrival times. Let's assume that the observations are exponential:

H0:X1,X2,...,XniidExp(λ)H_0 : X_1, X_2,...,X_n \overset{\text{iid}}{\sim} \text{Exp}(\lambda)

We want to perform a χ2\chi^2 goodness-of-fit test with equal-probability intervals, which means we must choose aia_i's such that:

F(ai)=P(Xai)=1eλai=ik,i=1,2,...,kF(a_i) = P(X \leq a_i) = 1 - e^{-\lambda a_i} = \frac{i}{k}, \quad i = 1,2,...,k

If the intervals are equal probability, then the probability that an observation falls in any of the kk intervals must equal 1/k1 / k. Correspondingly, F(x)F(x) must increase by 1/k1/k as it sweeps through each interval, until F(an)=1F(a_n) = 1.

In any event, let's solve for aia_i:

ik=1eλaieλai=1ikλai=ln(1ik)ai=1λln(1ik)\begin{alignedat}{1} \frac{i}{k} &= 1 - e^{-\lambda a_i} \\ e^{-\lambda a_i} &= 1 - \frac{i}{k} \\ -\lambda a_i &= \ln\left(1 - \frac{i}{k}\right) \\ a_i &= \frac{-1}{\lambda}\ln\left(1 - \frac{i}{k}\right) \\ \end{alignedat}

Unfortunately, λ\lambda is unknown, so we cannot calculate the aia_i's. We have to estimate λ\lambda. Thankfully, we might remember that the MLE is λ^=1/Xˉ\widehat\lambda = 1 / \bar X. Thus, by the invariance property, the MLEs of the aia_i's are:

a^i=1λ^ln(1ik)=Xˉln(1ik),i=1,2,...,k\begin{alignedat}{1} \widehat a_i &= \frac{-1}{\widehat\lambda}\ln\left(1 - \frac{i}{k}\right) \\ &= -\bar X \ln\left(1 - \frac{i}{k}\right), \quad i = 1,2,...,k \end{alignedat}

Suppose that we take n=100n = 100 observations and divide them into k=5k=5 equal-probability intervals. Let's also suppose that the same mean based on these observations is Xˉ=9.0\bar X = 9.0. Then:

a^i=9.0ln(10.2i),i=1,...,5\widehat a_i = -9.0\ln\left(1 - 0.2i\right), \quad i = 1,...,5

Given this formula, let's compute our first equal-probability interval:

(a^0,a^1]=(9.0ln(10.2(0)),9.0ln(10.2(1))]=(9.0ln(1),9.0ln(0.8)]=(0,2.01]\begin{alignedat}{1} (\widehat a_0, \widehat a_1] &= (-9.0\ln\left(1 - 0.2(0)\right), -9.0\ln\left(1 - 0.2(1)\right)] \\ &= (-9.0\ln\left(1\right), -9.0\ln\left(0.8\right)] \\ &= (0, 2.01] \end{alignedat}

What's the expected number of observations in each interval? Well, since we made sure to create equal-probability intervals, the expected value for each interval is:

Ei=npi=nkE_i = np_i = \frac{n}{k}

Now let's tally up how many observations fall in each interval, and record our OiO_i's. Consider the following table:

interval (ai1,ai]OiEi=n/k(0,2.01]2520(2.01,4.60]2720(4.60,8.25]2320(8.25,14.48]1320(14.48,)1220100100\begin{array}{c|cc} \text{interval }(a_{i-1}, a_i] & O_i & E_i = n/k \\ \hline (0,2.01] & 25 & 20 \\ (2.01,4.60] & 27 & 20 \\ (4.60, 8.25] & 23 & 20 \\ (8.25,14.48] & 13 & 20 \\ (14.48, \infty) & 12 & 20 \\ \hline & 100 & 100 \end{array}

Given these observations and expectations, we can compute our χ02\chi^2_0 goodness-of-fit statistic:

χ02=i=1k(OiEi)2Ei=9.80\chi^2_0 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} = 9.80

Let's choose the appropriate χ2\chi^2 quantile. We'll use α=0.05\alpha = 0.05 and v=k1sv = k - 1 - s. Since s=1s = 1 - remember, we estimated λ\lambda - then v=511=3v = 5 - 1 - 1 = 3. Therefore:

χα,k1s2=χ0.05,32=7.81\chi^2_{\alpha, k-1-s} = \chi^2_{0.05,3} = 7.81

Since our test statistic is greater than our quantile, we must reject H0H_0 and conclude that the observations are not exponential.

Weibull Example

In this lesson, we'll carry out a goodness-of-fit test for observations supposedly coming from the Weibull distribution. This procedure will take more work than the exponential, but it's more general, as the Weibull generalizes the exponential.

Weibull Example

Let's suppose that we have a series of observations, X1,X2,...,XnX_1, X_2,...,X_n, and we hypothesize that they are coming from a Weibull(rr, λ\lambda) distribution. The Weibull distribution has the following cdf:

F(x)=1e(λx)r,x0F(x) = 1 - e^{-(\lambda x)^r}, \quad x \geq 0

We say that the Weibull generalizes the exponential because, for r=1r=1, F(x)F(x) is the cdf of the exponential distribution:

F(x)=1eλx,x0F(x) = 1 - e^{-\lambda x}, \quad x \geq 0

Like we did with the exponential, we'd like to conduct a χ2\chi^2 goodness-of-fit test with equal-probability intervals. In other words, we will choose interval boundaries, aia_i's, such that:

F(ai)=1e(λai)r=ikF(a_i) = 1 - e^{-(\lambda a_i)^r} = \frac{i}{k}

If the intervals are equal probability, then the probability that an observation falls in any of the kk intervals must equal 1/k1 / k. Correspondingly, F(x)F(x) must increase by 1/k1/k as it sweeps through each interval, until F(an)=1F(a_n) = 1.

Let's now solve for aia_i:

1e(λai)r=ik1ik=e(λai)rln[1ik]=(λai)rln[1ik]1/r=λai1λln[1ik]1/r=ai,i=1,2,...,k\begin{alignedat}{1} 1 - e^{-(\lambda a_i)^r} &= \frac{i}{k} \\ 1 - \frac{i}{k} &= e^{-(\lambda a_i)^r} \\ \ln\left[1 - \frac{i}{k}\right] &= -(\lambda a_i)^r \\ \ln\left[1 - \frac{i}{k}\right]^{1/r} &= -\lambda a_i \\ \frac{-1}{\lambda}\ln\left[1 - \frac{i}{k}\right]^{1/r} &= a_i, \quad i = 1,2,...,k \end{alignedat}

Since λ\lambda and rr are unknown, we'll have two MLEs, so s=2s=2. Remember that ss is a parameter whose value we subtract from the degrees of freedom of the χ2\chi^2 distribution whose quantile we take during our test.

Let's differentiate the cdf, F(x)F(x), to get the pdf, f(x)f(x):

f(x)=λr(λx)r1e(λx)r,x0f(x) = \lambda r (\lambda x)^{r-1}e^{-(\lambda x)^r}, \quad x \geq 0

From there, we can take the likelihood function for an iid sample of size nn:

L(r,λ)=i=1nf(xi)=λnrrni=1nxir1exp[λri=1nxir]L(r, \lambda) = \prod_{i=1}^n f(x_i) = \lambda^{nr}r^n\prod_{i=1}^n x_i^{r-1} \exp\left[-\lambda^r \sum_{i=1}^n x^r_i\right]

If we take the natural logarithm of the likelihood function, we get:

ln(L)=nln(r)+(r1)ln(i=1nxi)+nrln(λ)λri=1nxir\ln(L) = n\ln(r) + (r-1)\ln\left(\prod^n_{i=1}x_i\right) + nr\ln(\lambda) - \lambda^r \sum_{i=1}^n x_i^r

Now, we have to maximize LL with respect to rr and λ\lambda. To do so, we take the partial derivative of LL with respect to the appropriate parameter, set it to zero, and solve for that parameter. After a bunch of algebra, we get this value for λ\lambda:

λ=(i=1nxir)1/r\lambda = \left(\sum_{i=1}^n x_i^r\right)^{-1/r}

Correspondingly, we get the following function for rr, such that f(r^)=0f(\widehat r) = 0:

g(r)=nr+i=1nln(xi)nixirln(xi)ixirg(r) = \frac{n}{r} + \sum_{i=1}^n \ln(x_i) - \frac{n \sum_i x_i^r \ln(x_i)}{\sum_i x_i^r}

How do we find the zero? Let's try Newton's method. Of course, to use this method, we need to know the derivative of g(r)g(r):

g(r)=nr2nixir[ln(xi)]2ixir+n[ixirln(xi)]2[ixir]2g'(r) = -\frac{n}{r^2} - \frac{n\sum_ix^r_i[\ln(x_i)]^2}{\sum_ix_i^r} + \frac{n\left[\sum_i x^r_i \ln(x_i)\right]^2}{\left[\sum_i x^r_i\right]^2}

Here's a reasonable implementation of Newton's method. Let's initialize r^0=Xˉ/S\widehat r_0 = \bar X / S, where Xˉ\bar X is the sample mean, and S2S^2 is the sample variance. Then, we iteratively improve our guess for r^\widehat r, using Newton's method:

r^jr^j1g(r^j1)g(r^j1)\widehat r_j \leftarrow \widehat r_{j-1} - \frac{g(\widehat r_{j-1})}{g'(\widehat r_{j-1})}

If g(r^j1)<0.001|g(\widehat r_{j-1})| < 0.001, then we stop and set the MLE r^=r^j\widehat r = \widehat r_j. Otherwise, we continue refining r^j\widehat r_j. Once we have r^\widehat r, to which Newton's method converges after only three or four iterations, we can immediately get λ^\widehat \lambda:

λ^=(i=1nxir^)1/r^\widehat \lambda = \left(\sum_{i=1}^n x_i^{\widehat r}\right)^{-1/\widehat r}

Then, by invariance, we finally have the MLEs for the equal-probability interval endpoints:

a^i=1λ^[ln(1ik)]1/r^,i=1,2,...,k\widehat a_i = \frac{1}{\widehat \lambda} \left[-\ln\left(1 - \frac{i}{k}\right) \right]^{1 / \widehat r}, i = 1,2,...,k

Let's suppose we take n=50n=50 observations and divide them into k=8k= 8 equal-probability intervals. Moreover, let's suppose that we calculate that r^=0.525\widehat r = 0.525 and λ^=0.161\widehat \lambda = 0.161. Given these parameters:

a^i=6.23[ln(1i8)]1.905,i=1,2,...,k\widehat a_i = 6.23 \left[-\ln\left(1 - \frac{i}{8}\right) \right]^{1.905}, i = 1,2,...,k

Further suppose that we get the following OiO_i's:

(a^i1,a^i]OiEi=n/k(0,0.134]66.25(0.134,0.578]56.25(11.54,24.97]56.25(24.97,]66.255050\begin{array}{c|cc} (\hat a_{i-1}, \hat a_i] & O_i & E_i = n / k \\ \hline (0,0.134] & 6 & 6.25 \\ (0.134,0.578] & 5 & 6.25 \\ \vdots \\ (11.54,24.97] & 5 & 6.25 \\ (24.97, \infty] & 6 & 6.25 \\ \hline & 50 & 50 \end{array}

Let's compute our χ2\chi^2 goodness-of-fit statistic:

χ02=i=1k=(OiEi)2Ei=1.20\chi^2_0 = \sum_{i=1}^k = \frac{(O_i - E_i)^2}{E_i} = 1.20

Given α=0.05\alpha = 0.05 and v=812=5v = 8 - 1 - 2 = 5, our quantile is:

χ0.05,52=11.1\chi^2_{0.05, 5} = 11.1

Since our test statistic is less than our quantile, we fail to reject the null hypothesis and assume that the observations are coming from a Weibull distribution.

Still More Goodness-of-Fit Tests

In this lesson, we'll look at other types of goodness-of-fit tests. In particular, we will look at Kolmogorov-Smirnov, which works well when we have small samples.

Kolmogorov-Smirnov Goodness-of-Fit Test

There are plenty of goodness-of-fit tests that we can use instead of the χ2\chi^2 test. The advantage of the Kolmogorov-Smirnov test (K-S) is that it works well in low-data situations, although we can use it perfectly well when we have ample data, too.

As usual, we'll test the following null hypothesis:

H0:X1,X2,...,Xniidsome distribution with cdf F(x)H_0: X_1, X_2,...,X_n \overset{\text{iid}}{\sim} \text{some distribution with cdf } F(x)

Recall that the empirical cdf, or sample cdf, of a series of observations, X1,...,XnX_1,...,X_n, is defined as:

F^n(x)number of Xisxn\hat{F}_n(x) \equiv \frac{\text{number of } X_i'\text{s} \leq x}{n}

In other words, the cdf of the sample, evaluated at xx, is equal to the ratio of observations less than or equal to xx to the total number of observations. Remember that F^n(x)\hat F_n(x) is a step function that jumps by 1/n1 / n at each XiX_i.

For example, consider the empirical cdf of ten Exp(1) observations - in blue below - on which we've superimposed the Exp(1) cdf, in red:

Notice that every time the empirical cdf encounters an observation, it jumps up by 1/101/10, or 10%10\%. If we look at the superimposed red line, we can see that the empirical cdf (generated from just ten observations) and the actual cdf fit each other quite well.

Indeed, the Glivenko-Cantelli Lemma says that the empirical cdf converges to the true cdf as the sample size increases: F^n(x)F(x)\hat F_n(x) \to F(x) as nn \to \infty. If H0H_0 is true, then the empirical cdf, F^n(x)\hat F_n(x), should be a good approximation to the true cdf, F(x)F(x), for large nn.

We want to answer the main question: Does the empirical distribution actually support the assumption that H0H_0 is true? If the empirical distribution doesn't closely resemble the distribution that H0H_0 is supposing, we should reject H0H_0.

In particular, the K-S test rejects H0H_0 if the following inequality holds:

DmaxxRF(x)F^n(x)>Dα,nD \equiv \max_{x \in \mathbb R} |F(x) - \hat F_n(x)| > D_{\alpha, n}

In other words, we define our test statistic, DD, as the maximum deviation between the hypothesized cdf, F(x)F(x), and the empirical cdf, F^n(x)\hat F_n(x). Note here that α\alpha is our level of significance, and Dα,nD_{\alpha, n} is a tabled K-S quantile. Interestingly, the value of Dα,nD_{\alpha, n} depends on the particular distribution we are hypothesizing, in addition to α\alpha and nn.

If the empirical cdf diverges significantly from the supposed true cdf, then DD will be large. In that case, we will likely reject the null hypothesis and conclude that the observations are probably not coming from the hypothesized cdf.

K-S Example

Let's test the following null hypothesis:

H0:X1,X2,...,XniidUnif(0,1)H_0: X_1, X_2,...,X_n \overset{\text{iid}}{\sim} \text{Unif}(0,1)

Of course, we've used the χ2\chi^2 goodness-of-fit test previously to test uniformity. Also, we probably wouldn't test the uniformity of numbers coming from an RNG using K-S because, in that case, we'd likely have millions of observations, and a χ2\chi^2 test would be just fine.

However, let's pretend that these XiX_i's are expensive to obtain, and therefore we only have a few of them; perhaps they are service times that we observed over the course of a day.

Remember the K-S statistic:

DmaxxRF(x)F^n(x)D \equiv \max_{x \in \mathbb R} | F(x) - \hat F_n(x)|

Now, remember the cdf for the uniform distribution:

F(x)={0,x<axabaa,xb0,x>bF(x) = \begin{cases} 0, & x < a \\ \frac{x-a}{b-a} & a, \leq x \leq b \\ 0, & x > b \end{cases}

For a=0,b=1a = 0, b = 1:

F(x)={0,x<0x,0x10,x>1F(x) = \begin{cases} 0, & x < 0 \\ x, & 0 \leq x \leq 1 \\ 0, & x > 1 \end{cases}

As a result, we can tighten the range of xx and substitute in the cdf. Consider:

D=max0x1xF^n(x)D = \max_{0 \leq x \leq 1} | x - \hat F_n(x)|

The maximum can only occur when xx equals one of the observations, X1,X2,...,XnX_1, X_2, ...,X_n. At any XiX_i, F^n(x)\hat F_n(x) jumps suddenly from (i1)/n(i-1) / n to i/ni / n. Since xx doesn't experience a jump itself at this point, the distance between xx and F^(x)\hat F(x) is maximized when xx equals some XiX_i.

Let's first define the ordered points, X(1)X(2)X(n)X_{(1)} \leq X_{(2)} \leq \cdots \leq X_{(n)}. For example, if X1=4X_1 = 4, X2=1X_2 = 1, and X3=6X_3 = 6, then X(1)=1X_{(1)} = 1, X(2)=4X_{(2)} = 4 and X(3)=6X_{(3)} = 6.

Instead of computing the maximum over all the xx-values between zero and one, we only have to compute the maximum taken at the jump points. We can calculate the two potential maximum values:

D+max1in{inX(i)},Dmax1in{X(i)i1n}D^+ \equiv \max_{1 \leq i \leq n}\left\{\frac{i}{n} - X_{(i)}\right\}, \quad D^- \equiv \max_{1 \leq i \leq n}\left\{X_{(i)} - \frac{i-1}{n}\right\}

Given those two values, it turns out that D=max(D+,D)D = \max(D^+, D^-).

Let's look at a numerical example. Consider:

Xi0.0390.7060.0160.1980.793X(i)0.0160.0390.1980.7060.793\begin{array}{c|ccccc} X_i & 0.039 & 0.706 & 0.016 & 0.198 & 0.793 \\ X_{(i)} & 0.016 & 0.039 & 0.198 & 0.706 & 0.793 \end{array}

Let's calculate the D+D^+ and DD^- components:

Xi0.0390.7060.0160.1980.793X(i)0.0160.0390.1980.7060.793in0.20.40.60.81.0i1n00.20.40.60.8inX(i)0.1840.3610.4020.0940.207X(i)i1n0.0160.106\begin{array}{c|ccccc} X_i & 0.039 & 0.706 & 0.016 & 0.198 & 0.793 \\ X_{(i)} & 0.016 & 0.039 & 0.198 & 0.706 & 0.793 \\ \hline \frac{i}{n} & 0.2 & 0.4 & 0.6 & 0.8 & 1.0 \\ \frac{i-1}{n} & 0 & 0.2 & 0.4 & 0.6 & 0.8 \\ \frac{i}{n} - X_{(i)} & 0.184 & 0.361 & \textbf{0.402} & 0.094 & 0.207 \\ X_{(i)} - \frac{i-1}{n} & 0.016 & - & - & \textbf{0.106} & - \\ \end{array}

As we can see from the bolded cells, D+=0.402D^+ = 0.402, and D=0.106D^- = 0.106, so D=0.402D = 0.402. Now, we can go to a K-S table for the uniform distribution. We set n=5n=5 and chose α=0.05\alpha = 0.05, and D0.05,5=0.565D_{0.05, 5} = 0.565. Because the statistic is less than the quantile, we fail to reject uniformity.

That being said, we encountered several small numbers in our sample, leading us to perhaps intuitively feel as though these XiX_i's are not uniform. One of the properties of the K-S test is that it is quite conservative: it needs a lot of evidence to reject H0H_0.

Other Tests

There are many other goodness-of-fit tests, such as:

  • Anderson-Darling
  • Cramér-von Mises
  • Shapiro-Wilk (especially appropriate for testing normality)

We can also use graphical techniques, such as Q-Q plots, to evaluate normality. If observations don't fall on a y=xy= x line on a Q-Q plot, we have to question normality.

Problem Children

In this lesson, we will talk about performing input analysis under problematic circumstances.

Problem Children

We might think that we can always find a good distribution to fit our data. This assumption is not exactly true, and there are some situations where we have to be careful. For example, we might have little to no data or data that doesn't look like one of our usual distributions. We could also be working with nonstationary data; that is, data whose distribution changes over time. Finally, we might be dealing with multivariate or correlated data.

No / Little Data

Believe it or not, this issue turns up all the time. For example, at the beginning of every project, there simply are no observations available. Additionally, even if we have data, the data might not be useful when we receive it: it might contain unrealistic, or flat-out wrong, values, or it might not have been cleaned properly. As a concrete example, we can imagine receiving airport data that shows planes arriving before they depart.

What do we do in these situations? We can interview so-called "domain experts" and try to get at least the minimum, maximum, and "most likely" values from them so that we can guess at uniform or triangular distributions. If the experts can provide us with quantiles - what value we should expect 95% of the observations to fall below, for example - that's even better. At the very least, we can discuss the nature of the observations with the expert and try to extract some information that will allow us to make a good guess at the distribution.

If we have some idea about the nature of the random variables, perhaps we can start to make good guesses at the distribution. For example, if we know that the data is continuous, we know that a geometric distribution doesn't make sense. If we know that the observations have to do with arrival times, then we will treat them differently than if they are success/failure binaries.

Do the observations adhere to Poisson assumptions? If so, then we are looking at the Poisson distribution, if we are counting arrivals, or the exponential distribution, if we are working with interarrival times. Are the observations averages or sums? We might be able to use the central limit theorem and consider the normal distribution. If the observations are bounded, we might consider the beta distribution, which generalizes the uniform and triangular. We might consider the gamma, Weibull, or lognormal distributions if we are working with reliability data or job time data.

Perhaps we can understand something about the physical characteristics underlying the random variable. For example, the distribution of the particulate matter after an explosion falls in a lognormal distribution. We also might know that the price of a stock could follow the lognormal distribution.

Goofy Distributions

Let's consider a poorly designed exam with two modes: students either did quite well or very poorly. We can represent this distribution using some combination of two normal random variables, but most commercial software packages cannot fit distributions like this. For example, consider Minitab fitting a normal to the data below.

We can attempt to model such a distribution as a mixture of other reasonable distributions. Even easier, we can sample directly from the empirical distribution, or perhaps a smoothed version thereof, which is called bootstrapping.

Nonstationary Data

Arrival rates change over time. For example, consider restaurant occupancy, traffic on the highway, call center activity, and seasonal demand for products. We have to take this variability into account, or else we will get garbage-in-garbage-out.

One strategy we might use is to model the arrivals as a nonhomogeneous Poisson process, and we explored NHPPs back when we discussed random variate generation. Of course, we have to model the rate function properly. Arena uses a piecewise-constant rate function, which remains constant within an interval, but can jump up or down in between intervals.

Multivariate / Correlated Data

Of course, data don't have to be iid! Data is often multivariate. For example, a person's height and weight are correlated. When modeling people, we have to ensure that we generate the correlated height and weight simultaneously, lest we generate a seven-foot-tall person who weighs ninety pounds.

Data can also be serially correlated. For example, monthly unemployment rates are correlated: the unemployment rate next month is correlated with the rate this month (and likely the past several months). As another example, arrivals to a social media site might be correlated if something interesting gets posted, and the public hears about it. As a third example, a badly damaged part may require more service than usual at a series of stations, so the consecutive service times are likely correlated. Finally, if a server gets tired, his service times may be longer than usual.

What can we do? First, we need to identify situations in which data is multivariate or serially correlated, and we can conduct various statistical tests to surface these relationships. From there, we can propose appropriate models. For example, we can use the multivariate normal distribution if we are looking at heights and weights. We can also use time series models for serially correlated observations, such as the ARMA(p,q), EAR(1), and ARP processes we discussed previously.

Even if we guess at the model successfully, we still need to estimate the relevant parameters. For example, in the multivariate normal distribution, we have to estimate the marginal means and variances, as well as the covariances. Some parameters are easier to estimate than others. For a simple time series, like AR(1), we only have to estimate the coefficient, ϕ\phi. For more complicated models, like ARMA(p,q) processes where p and q are greater than one, we have to use software, such as Box-Jenkins technology.

After we've guessed at the distribution and estimated that relevant parameters, we have to finally validate our estimated model to see if it is any good. If we run a simulation and see something that looks drastically different from the original process, we have to reevaluate.

Alternatively, we can bootstrap samples from an empirical distribution if we are fortunate enough to have a lot of data.

Demo Time

In this lesson, we will demonstrate how we carry out an elementary input analysis using Arena.

Software Interlude

Arena has functionality that automatically fits simple distributions to our data, which we can access from Tools > Input Analyzer. This input analyzer purportedly gives us the best distribution from its library, along with relevant sample and goodness-of-fit statistics.

ExpertFit is a specialty product that performs distribution fitting against a much larger library of distributions. Both Minitab and R have some distribution fitting functionality, but they are not as convenient as Arena and ExpertFit.

The only drawback to some of these programs is that they have issues dealing with the problem children we discussed previously.

Demo Time

Let's hop over to Arena. First, we click Tools > Input Analyzer. Next, we select File > New and then File > Data File > Use Existing... to get started. From there, we can select the relevant .dst file. Let's click on normal.dst and take a look at the data.

We can see that Arena gives us several pieces of information about our data, such as:

  • number of observations (5000)
  • minimum value (0.284)
  • maximum value (7.29)
  • sample mean (3.99)
  • sample standard deviation (1)

We can try to fit various distributions by selecting from the Fit menu. Let's select Fit > Triangular and see what we get.

This distribution tries to fit the minimum, maximum, and modal values of the data, and, as we can see, it's a terrible fit. Arena believes that the best fit is a TRIA(0,3.99,7.99) distribution. The χ2\chi^2 test statistic for this fit is 1680, and the corresponding p-value is less than 0.005. The K-S statistic is 0.134 and has a corresponding p-value of less than 0.01. In plain English, the data is not triangular.

Let's try again. We can select Fit > Weibull and see what we get.

Arena believes that the best fit is a WEIB(4.33, 4.3) distribution. The χ2\chi^2 test statistic for this fit is 66.2, and the corresponding p-value is less than 0.005. The K-S statistic is 0.0294 and has a corresponding p-value of less than 0.01. Again, even though the fit "looks" decent, we would still reject this data as being Weibull per the numbers.

Let's try again. We can select Fit > Erlang and see what we get.

Arena believes that the best fit is an ERLA(0.285, 14) distribution. The χ2\chi^2 test statistic for this fit is 248, and the corresponding p-value is less than 0.005. The K-S statistic is 0.0414 and has a corresponding p-value of less than 0.01. The Erlang distribution does not fit this data.

Let's try again. We can select Fit > Normal and see what we get.

Arena believes that the best fit is a NORM(3.99, 1) distribution. The χ2\chi^2 test statistic for this fit is 22.1, and the corresponding p-value is 0.732. The K-S statistic is 0.00789 and has a corresponding p-value greater than 0.15. We would fail to reject the null hypothesis in this case and conclude that this data is normal.

As a shortcut, we can select Fit > Fit All, and Arena will choose the best distribution from its entire library. If we select this option, Arena chooses the exact same normal distribution.

Let's look at some new data. We can click File > Data File > Use Existing... and then select the lognormal.dst file. Let's select Fit > Beta and see what we get.

Arena believes that the best fit is a shifted, widened BETA(3.34, 8.82) distribution. The χ2\chi^2 test statistic for this fit is 126, and the corresponding p-value is less than 0.005. The K-S statistic is 0.0239 and has a corresponding p-value of less than 0.01. The beta distribution is not a good fit.

Let's try again. We can select Fit > Exponential and see what we get - what a joke.

Let's try again. We can select Fit > Weibull and see what we get.

Arena believes that the best fit is a shifted WEIB(2.49, 2.25) distribution. The χ2\chi^2 test statistic for this fit is 237, and the corresponding p-value is less than 0.005. The K-S statistic is 0.0393 and has a corresponding p-value of less than 0.01. Clearly, Weibull is out.

Let's try again. We can select Fit > Lognormal and see what we get.

Arena believes that the best fit is a shifted LOGN(2.22, 1.14) distribution. The χ2\chi^2 test statistic for this fit is 257, and the corresponding p-value is less than 0.005. The K-S statistic is 0.0458 and has a corresponding p-value of less than 0.01. Surprisingly, we would reject these observations as being lognormal, even though we know we sampled them from that distribution.

Let's select Fit > Fit All to see what Arena thinks is the best fit.

Arena believes that the best overall fit is a shifted ERLA(0.44, 5) distribution. The χ2\chi^2 test statistic for this fit is 42.3, and the corresponding p-value is 0.0182. The K-S statistic is 0.0122 and has a corresponding p-value greater than 0.15. Depending on our significance level, we may reject or fail to reject this data as coming from the Erlang distribution.

Let's look at a final example. We can click File > Data File > Use Existing... and then select partbprp.dst file. Note that we don't know what distribution this data is coming from ahead of time.

Let's select Fit > Fit All to see what Arena thinks is the best fit.

Arena believes that the best overall fit is a shifted GAMM(0.775, 4.29) distribution. The χ2\chi^2 test statistic for this fit is 4.68, and the corresponding p-value is 0.337. The K-S statistic is 0.0727 and has a corresponding p-value greater than 0.15. We would fail to reject that this data is coming from the Gamma distribution.

If we click on Data File > Generate New..., we can generate a file of observations according to some distribution. Let's write 5000 TRIA(2,5,10) observations to a file called triang2.dst.

Let's load these observations and then select Fit > Fit All.

Arena believes that the best overall fit is a TRIA(2,5,10) distribution. The χ2\chi^2 test statistic for this fit is 33.6, and the corresponding p-value is 0.565. The K-S statistic is 0.00932 and has a corresponding p-value greater than 0.15. Perfect.

OMSCS Notes is made with in NYC by Matt Schlenker.

Copyright © 2019-2023. All rights reserved.

privacy policy