Comparing Systems, Continued

30 minute read

Notice a tyop typo? Please submit an issue or open a PR.

Comparing Systems, Continued

Normal Means Selection

In this lesson, we will look at the classic problem of choosing the normal distribution having the largest mean. Furthermore, we'll guarantee a certain probability of correct selection.

Find the Normal Distribution with the Largest Mean

Let's start with some underlying assumptions. We will assume that we can take independent observations, Yi1,Yi2,...,Yin,(1≀i≀k)Y_{i1}, Y_{i2},..., Y_{in}, (1 \leq i \leq k), from kβ‰₯2k \geq 2 normal populations ∏1,...,∏k\prod_1,...,\prod_k. Here, ∏i\prod_i refers to a normal distribution with unknown mean ΞΌi\mu_i and known or unknown variance Οƒi2\sigma_i^2.

Let's denote the vector of means as ΞΌ=(ΞΌ1,...,ΞΌk)\boldsymbol \mu = (\mu_1,...,\mu_k) and the vector of variances as Οƒ2=(Οƒ12,...,Οƒk2)\boldsymbol \sigma^2 = (\sigma^2_1,...,\sigma^2_k). Furthermore, let's denote the ordered (but unknown) ΞΌi\mu_i's as ΞΌ[1]≀⋯≀μ[k]\mu_{[1]} \leq \cdots \leq \mu_{[k]}.

We will refer to the "best" system as that which has the largest mean, ΞΌ[k]\mu_{[k]}. Our goal is to select the population - strategy, treatment, alternative, etc. - associated with mean ΞΌ[k]\mu_{[k]}. We make a correct selection (CS) if we achieve our goal.

Indifference-Zone Probability Requirement

For specified constants, Pβˆ—P^* and Ξ΄βˆ—\delta^*, where Ξ΄βˆ—>0\delta^* > 0 and 1/k<Pβˆ—<11/k < P^* < 1 we require:

P(CS)β‰₯Pβˆ—wheneverΞΌ[k]βˆ’ΞΌ[kβˆ’1]β‰₯Ξ΄βˆ—,(1)P(\text{CS}) \geq P^* \quad \text{whenever} \quad \mu_{[k]} - \mu_{[k - 1]} \geq \delta^*, \quad (1)

We want the probability of correct selection to be at least Pβˆ—P^*, which we specify according to our sensitivity. Moreover, Pβˆ—P^* is in the range (1/k,1)(1/k, 1). The lower bound is 1/k1/k, not zero, because 1/k1/k is the probability of correct selection given a random guess. Of course, PP cannot equal one unless we take an infinite number of samples.

On the other hand, Ξ΄βˆ—\delta^* merely has to be greater than zero, because its a bound on ΞΌ[k]βˆ’ΞΌ[kβˆ’1]\mu_{[k]} - \mu_{[k-1]}, and ΞΌ[k]>ΞΌ[kβˆ’1]\mu_{[k]} > \mu_{[k-1]}. Remember, we also specify Ξ΄βˆ—\delta^*.

We can think about Ξ΄βˆ—\delta^* as the smallest difference worth detecting. When we specify a small Ξ΄βˆ—\delta^*, we are saying that we want to catch very small differences between ΞΌ[k]\mu_{[k]} and ΞΌ[kβˆ’1]\mu_{[k-1]}. As we increase Ξ΄βˆ—\delta^*, we increase the difference between ΞΌ[k]\mu_{[k]} and ΞΌ[kβˆ’1]\mu_{[k-1]} that we are willing to tolerate.

As long as the difference between the two values is less than Ξ΄βˆ—\delta^*, we don't care which we pick. Of course, if the difference between ΞΌ[k]\mu_{[k]} and ΞΌ[kβˆ’1]\mu_{[k-1]} is bigger than Ξ΄βˆ—\delta^*, we want to make sure we select the right population.

Obviously, the probability of correct selection depends on the differences ΞΌiβˆ’ΞΌj\mu_i - \mu_j, the sample size, nn, and Οƒ2\boldsymbol \sigma^2.

Parameter configurations ΞΌ\boldsymbol \mu satisfying the constraint ΞΌ[k]βˆ’ΞΌ[kβˆ’1]β‰₯Ξ΄βˆ—\mu_{[k]} - \mu_{[k-1]} \geq \delta^* are in the preference zone for correct selection. For example, if ΞΌ[k]=5\mu_{[k]} = 5, ΞΌ[kβˆ’1]=3\mu_{[k-1]} = 3, and Ξ΄βˆ—=1\delta^* = 1, then the ΞΌ\boldsymbol \mu containing this ΞΌ[k]\mu_{[k]} and ΞΌ[kβˆ’1]\mu_{[k-1]} would be in the preference zone.

If ΞΌ\boldsymbol \mu falls in the preference zone, we want to ensure that we make the correct selection.

Parameter configurations ΞΌ\boldsymbol \mu not satisfying the constraint ΞΌ[k]βˆ’ΞΌ[kβˆ’1]β‰₯Ξ΄βˆ—\mu_{[k]} - \mu_{[k-1]} \geq \delta^* are in the indifference zone. For example, if ΞΌ[k]=5\mu_{[k]} = 5, ΞΌ[kβˆ’1]=3\mu_{[k-1]} = 3, and Ξ΄βˆ—=3\delta^* = 3, then the ΞΌ\boldsymbol \mu containing this ΞΌ[k]\mu_{[k]} and ΞΌ[kβˆ’1]\mu_{[k-1]} would be in the indifference zone.

If ΞΌ\boldsymbol \mu falls in the indifference zone, it doesn't matter too much if we choose ΞΌ[kβˆ’1]\mu_{[k-1]} over ΞΌ[k]\mu_{[k]}, because the two values are within the threshold we've set.

Any procedure that guarantees equation (1)(1) above is said to be employing the indifference zone approach.

So Many Procedures

There are hundreds of different procedures we can use that implement this approach. Of note are the following:

We'll look at the basics of the single-stage procedure in the next lesson. As a word of caution, we wouldn't use this procedure in the wild. It relies on common, known variances among the competing populations, and we rarely know variances when running experiments.

Single-Stage Normal Means Procedure

In this lesson, we'll look at a specific, simple single-state procedure. The purpose of this light introduction is to give a taste of the kinds of procedures available for these types of general selection problems. As we said previously, this procedure assumes that all of the competing normal distributions have common, known variances, which is almost always an unrealistic assumption. Better procedures are out there.

Single-Stage Procedure

The single-stage procedure, NB\mathcal{N}_\text{B}, from Bechhofer, takes all necessary observations and makes a selection decision at once. It assumes that the populations have common, known variance, which is a fairly restrictive assumption.

For kk competing populations, with specified Pβˆ—P^* and Ξ΄βˆ—/Οƒ\delta^* / \sigma, we determine a sample size nn, usually from a table, such that our probability requirement is satisfied. Remember that Pβˆ—P^* and Ξ΄βˆ—\delta^* are the previously discussed parameters, and Οƒ\sigma is the known standard deviation.

Once we have nn, we will take a random sample of nn observations Yij,(1≀j≀n)Y_{ij}, \enspace (1 \leq j \leq n) in a single stage from ∏i,(1≀i≀k)\prod_i, \enspace (1 \leq i \leq k).

We then calculate the kk sample means:

YΛ‰i=βˆ‘j=1nYij/n,(1≀i≀k)\bar Y_i = \sum_{j=1}^n Y_{ij}/n, \quad (1 \leq i \leq k)

We will select the population that yields the largest sample mean, YΛ‰[k]=max⁑{YΛ‰1,...,YΛ‰k}\bar Y_{[k]} = \max\{\bar Y_1,..., \bar Y_k\} as the one associated with ΞΌ[k]\mu_{[k]}.

The procedure is very intuitive. All that remains is to figure out nn. We can use a table or a multivariate normal quantile, or we can compute nn via a separate simulation if all else fails.

Let's take a look at the table:

We can see the various parameters along the left and top axes, and the required value for nn in the corresponding cell. For instance, if we have k=3k=3 competitors, and we want the probability of correct selection to be Pβˆ—=0.9P^* = 0.9, and we want Ξ΄βˆ—/Οƒ=0.5\delta^* / \sigma = 0.5, then we'll take n=20n=20 observations.

If we don't want to use the table, we can directly calculate nn using a multivariate normal quantile:

n=⌈2(ΟƒZkβˆ’1,1/2(1βˆ’Pβˆ—)/Ξ΄βˆ—)2βŒ‰n = \left\lceil 2\left(\sigma Z^{(1 - P^*)}_{k-1,1/2} / \delta^* \right)^2 \right\rceil

Let's remember our probability requirement:

P(CS)β‰₯Pβˆ—wheneverΞΌ[k]βˆ’ΞΌ[kβˆ’1]β‰₯Ξ΄βˆ—P(\text{CS}) \geq P^* \quad \text{whenever} \quad \mu_{[k]} - \mu_{[k - 1]} \geq \delta^*

The value of nn satisfies the above requirement for any vector of unknown means, ΞΌ\boldsymbol \mu, having the following configuration:

ΞΌ[1]=ΞΌ[kβˆ’1]=ΞΌ[k]βˆ’Ξ΄βˆ—\mu_{[1]} = \mu_{[k-1]} = \mu_{[k]} - \delta^*

This configuration is known as the slippage, or least-favorable, configuration. All of the inferior means are the same, and the best mean is Ξ΄βˆ—\delta^* better than the rest. This configuration is least-favorable because, for fixed nn, it minimizes P(CS)P(\text{CS}) among all ΞΌ\boldsymbol \mu in the preference zone.


Let's look at a proof of how we calculate nn. Let's start by defining Pβˆ—P^* as the probability of correct selection given that we are in the least-favorable configuration:

Pβˆ—=P(CS∣LF)P^* = P(\text{CS}|\text{LF})

The correct selection is that population that has the true best mean. We can only make the correct selection if the sample mean corresponding to the true best mean is bigger than all the other sample means. Assuming the true best population is the kkth one:

Pβˆ—=P{YΛ‰i<YΛ‰k,i=1,...,kβˆ’1∣LF}P^* = P\{\bar Y_i < \bar Y_k, i = 1,...,k-1|\text{LF}\}

If we select population kk, we better make sure that its sample mean is larger than the sample means from all other competing populations.

Let's standardize the right-hand side of the inequality by subtracting its true mean, ΞΌk\mu_k, whatever that may be, and dividing by its variance, Οƒ2/n\sigma^2 / n. Of course, we have to apply the same transformation to the left-hand side:

Pβˆ—=P{YΛ‰iβˆ’ΞΌkΟƒ2/n<YΛ‰kβˆ’ΞΌkΟƒ2/n,i=1,...,kβˆ’1∣LF}P^* = P\left\{\frac{\bar Y_i - \mu_k}{\sqrt{\sigma^2 / n}} < \frac{\bar Y_k - \mu_k}{\sqrt{\sigma^2 / n}}, i = 1,...,k-1 \Big| \text{LF} \right\}

We know that the right-hand side of the inequality is a standard normal random variable. Let's call that xx now. We can condition it out by integrating over all xx. Notice that we have to include the standard normal pdf, Ο•(x)\phi(x), in this integration:

Pβˆ—=∫RP{YΛ‰iβˆ’ΞΌkΟƒ2/n<x,i=1,...,kβˆ’1∣LF}Ο•(x)dxP^* = \int_{\mathbb{R}} P\left\{\frac{\bar Y_i - \mu_k}{\sqrt{\sigma^2 / n}} < x, i = 1,...,k-1 \Big| \text{LF} \right\}\phi(x) dx

Now, let's standardize the right-hand side of the inequality. We can't just subtract by ΞΌi\mu_i since we are already subtracting by ΞΌk\mu_k, so we have to add an adjustment factor on the right-hand side:

Pβˆ—=∫RP{YΛ‰iβˆ’ΞΌiΟƒ2/n<x+nΞ΄βˆ—Οƒ,i=1,...,kβˆ’1}Ο•(x)dxP^* = \int_{\mathbb{R}} P\left\{\frac{\bar Y_i - \mu_i}{\sqrt{\sigma^2 / n}} < x + \frac{\sqrt{n}\delta^*}{\sigma}, i = 1,...,k-1 \right\}\phi(x) dx

Now we are dealing with the cdf of a standard normal random variable, by definition. Therefore:

Pβˆ—=∫RΞ¦kβˆ’1(x+nΞ΄βˆ—Οƒ)Ο•(x)dxP^* = \int_{\mathbb{R}} \Phi^{k-1}\left(x + \frac{\sqrt{n}\delta^*}{\sigma}\right)\phi(x) dx

Since the observations are independent, and we have kβˆ’1k-1 of them, we have to multiply the expression together kβˆ’1k-1 times. That's why we have the Ξ¦kβˆ’1\Phi^{k-1} notation.

If we set h=(n+Ξ΄βˆ—)/Οƒh = (\sqrt{n} + \delta^*) / \sigma, then:

Pβˆ—=∫RΞ¦kβˆ’1(x+h)Ο•(x)dxP^* = \int_{\mathbb{R}} \Phi^{k-1}(x+h)\phi(x) dx

Finally, we solve numerically for hh and set n=⌈(hΟƒ/Ξ΄βˆ—)2βŒ‰n = \lceil (h\sigma / \delta^*)^2\rceil.


Suppose k=4k = 4, and we want to detect a difference in means as small as 0.20.2 standard deviations with P(CS)β‰₯0.99P(\text{CS}) \geq 0.99. The table for NB\mathcal{N}_\text{B} calls for n=361n = 361 observations per population.

If, after taking n=361n = 361 observations, we find that Yˉ1=13.2\bar Y_1 = 13.2, Yˉ2=9.8\bar Y_2 = 9.8, Yˉ3=16.1\bar Y_3 = 16.1, and Yˉ4=12.1\bar Y_4 = 12.1, then we select population three as the best. Since population three has the largest sample mean, we state that it has the largest true mean.

Note that increasing Ξ΄βˆ—\delta^* or decreasing Pβˆ—P^* requires a smaller nn. For example, when Ξ΄βˆ—/Οƒ=0.6\delta^* / \sigma = 0.6 and P=0.95P = 0.95, then NB\mathcal{N}_{\text{B}} only requires n=24n=24 observations per population.

Normal Means Extensions (OPTIONAL)

In this lesson, we'll look at some extensions of Bechhofer's single-stage procedure. As we saw, Bechhofer's procedure is rather limited and only applies to rare situations in which the systems share a common, known variance.

Good News

Bechhofer's single-stage procedure is very easy and intuitive. As we discussed, we look up in a table the number of observations, nn, that we need to take for each of kk competing systems based on parameters Pβˆ—P^* and Ξ΄βˆ—/Οƒ\delta^* / \sigma. After taking nn observations, we select the population with the largest sample mean. With a probability of at least Pβˆ—P^*, that population will have the largest true mean.

As it turns out, this procedure is robust against certain assumption violations. For example, if we are sampling exponential data instead of normal data, we might still be able to use this procedure. However, if the observations are dependent, this procedure does not work well.

Bad News

Unfortunately, this procedure does have some drawbacks. For example, the assumption that the variances are common and known almost renders the procedure unusable in the real world since we rarely, if ever, know the variances of the populations we are studying.

Furthermore, this procedure is conservative in that we typically require too many observations to do the job. The procedure is designed to work for the least-favorable configuration. Sometimes, one population is so superior that we are in better shape than the least favorable configuration. In these "more-favorable" configurations, we could achieve Pβˆ—P^* with fewer samples than the table specifies.

The procedure also requires the competing populations to be independent. This requirement is not necessarily valid when we have simulation data, so we have to be careful.

What to Do?

Any real-life problem will most likely involve unknown and unequal variances among the populations. Thankfully, there are procedures that we can use.

We might use a technique like Rinott's two-stage procedure. This procedure estimates the variances of the populations in the first stage of sampling. This stage of observations determines how many observations are required in stage two to get the desired probability of correct selection.

Kim and Nelson's sequential procedure is probably the most-used contemporary procedure. In stage one, the procedure estimates variances, just like Rinott's does. Then, the procedure samples from the competing populations in multiple stages and eliminates those that aren't performing highly as sampling proceeds.

These procedures obviously require iid normal observations within each system, so we likely cannot apply them to simulation data. For example, random variables such as consecutive waiting times are correlated and nonnormal.

Instead of consecutive waiting times, we can take batches of waiting times and work with the batch means, which are approximately normal and independent for sufficiently large batch size.

We usually require that competing systems be independent. However, various procedures exist that allow us to work with correlated systems. We can leverage these procedures in simulations where we can induce a positive correlation between systems. If handled properly, correlation can help determine which competing system is the best.

Bernoulli Probability Selection

In this lesson, we'll talk about selecting the Bernoulli population with the largest success parameter.

Bernoulli Selection - Introduction

Our goal is to select the Bernoulli population with the largest success parameter from several competing systems. For example, we might want to select:

  • the most effective anti-cancer drug
  • the simulated system that is most likely to meet a design specification
  • the (s,S) inventory policy with the highest profit probability

There are hundreds of such procedures for selecting the correct Bernoulli population. Of note are:

We have kk competing Bernoulli populations with unknown success parameters p1,p2,...,pkp_1, p_2,...,p_k. We denote the ordered pp's by p[1]≀p[2]≀⋯≀p[k]p_{[1]} \leq p_{[2]} \leq \cdots \leq p_{[k]}. Our goal is to select the population having the largest probability p[k]p_{[k]}.

As in previous lessons, we will again specify a formal probability requirement. For specified constants, (Pβˆ—,Ξ”βˆ—)(P^*, \Delta^*), where Ξ”βˆ—>0\Delta^* > 0 and 1/k<Pβˆ—<11/k < P^* < 1, we require:

P(CS)β‰₯Pβˆ—wheneverΞΌ[k]βˆ’ΞΌ[kβˆ’1]β‰₯Ξ”βˆ—,P(\text{CS}) \geq P^* \quad \text{whenever} \quad \mu_{[k]} - \mu_{[k - 1]} \geq \Delta^*, \quad

This requirement states that whenever there is a large difference between the best and second-best system - greater than or equal to Ξ”βˆ—\Delta^* - we want to ensure that we make the right choice, with probability Pβˆ—P^*.

A Single-Stage Procedure

Let's look at the single-stage procedure, BSH\mathcal{B}_{\text{SH}}, from Sobel and Huyett.

For specified (Pβˆ—,Ξ”βˆ—)(P^*, \Delta^*), we find nn in the table. Here's the table:

Next, we take a sample of nn observations, Xij(1≀j≀n)X_{ij} (1 \leq j \leq n), in a single stage from each population (1≀i≀k)(1 \leq i \leq k). From there, we calculate the kk sample sums:

Yin=βˆ‘j=1nXijY_{in} = \sum_{j=1}^n X_{ij}

In plain English, YinY_{in} represents the number of successes from the iith population under study. We then select the treatment that yielded the largest YinY_{in} as the one associated with p[k]p_{[k]}. If there happens to be a tie, we select a random winner among the ties.


Suppose we want to select the best of k=4k = 4 treatments. We want to be correct with probability of at least Pβˆ—=0.95P^* = 0.95 whenever p[4]βˆ’p[3]β‰₯0.10p_{[4]} - p_{[3]} \geq 0.10. The table shows that we need n=212n = 212 observations from each population to meet this requirement.

Suppose that, at the end of sampling, we have the following success counts:

Y1,212=70Y2,212=145Y3,212=95Y4,212=102Y_{1,212} = 70 \quad Y_{2,212} = 145 \quad Y_{3,212} = 95 \quad Y_{4,212} = 102

Then we select population two as best.

Bernoulli Extensions (OPTIONAL)

In this lesson, we will look at extensions of the Bernoulli procedure from the previous lesson. These extensions will save us many observations.


Let's look at a single-stage extension from Bechhofer and Kulkarni called curtailment. Here, we carry out the single-stage procedure, except we stop sampling when the population in second place can at best tie. It doesn't make sense to take any more observations after a winner is guaranteed.

As it turns out, curtailment gives the same probability of correct selection, P(CS)P(\text{CS}), but with a lower expected number of observations, x≀nx \leq n.

For example, remember that, when k=4k=4, Pβˆ—=0.95P^*=0.95, and Ξ”βˆ—=0.10\Delta^* = 0.10, the single-stage procedure requires us to take n=212n = 212 observations from each population, for a total of N=848N = 848 observations.

Suppose that, after taking 180 samples from each population, we have the intermediate result Y1,180=50Y_{1,180} = 50, Y2,180=130Y_{2,180} = 130, Y3,180=74Y_{3,180} = 74, and Y4,180=97Y_{4,180} = 97.

We can stop sampling right now and select population two as the best because it's not possible for population four to catch up in the remaining 212βˆ’180=32212 - 180 = 32 observations. At best, Y4,212=129<Y2,180≀Y2,212Y_{4,212} = 129 < Y_{2,180} \leq Y_{2,212}.

A Sequential Procedure

An even better procedure is the sequential procedure, BBKS\mathcal B_{\text{BKS}}, by Bechhofer, Kiefer, and Sobel from 1968.

This procedure involves a new probability requirement. For specified (Pβˆ—,ΞΈβˆ—)(P^*, \theta^*), with 1/k<Pβˆ—<11/k < P^* < 1 and ΞΈβˆ—>1\theta^* > 1, we require P(CS)β‰₯Pβˆ—P(\text{CS}) \geq P^* whenever the odds ratio:

p[k]/(1βˆ’p[k])p[kβˆ’1]/(1βˆ’p[kβˆ’1])β‰₯ΞΈβˆ—\frac{p_{[k]} / (1 - p_{[k]})}{p_{[k-1]} / (1 - p_{[k-1]})} \geq \theta^*

Here, we are taking the ratio of the odds that the kkth population is the best over the odds that the kβˆ’1k-1th population is the best. Whenever that ratio is β‰₯ΞΈβˆ—\geq \theta^*, we want to ensure that the probability of correct selection is at least Pβˆ—P^*.

The procedure proceeds in stages. We take one Bernoulli observation from each population, and we reevaluate to see if the odds ratio satisfies our requirement. This method is even more efficient than curtailment, which is more efficient than the basic single-stage procedure.

Let's take a look at the procedure itself. At the mmth stage of experimentation, mβ‰₯1m \geq 1, we observe the random Bernoulli vector, (X1m,...,Xkm)(X_{1m},...,X_{km}). For example, if we are looking at k=5k = 5 populations, the vector might look like (0,1,1,1,0)(0,1,1,1,0).

Next, we compute the population sums so far:

Yim=βˆ‘j=1mXij,(1≀i≀k)Y_{im} = \sum_{j=1}^m X_{ij}, \quad (1 \leq i \leq k)

Then, we order the sums:

Y[1]m≀⋯≀Y[k]mY_{[1]m} \leq \cdots \leq Y_{[k]m}

We stop at stage mm if the following inequality holds:

Zmβ‰‘βˆ‘i=1kβˆ’1(1/ΞΈβˆ—)Y[k]mβˆ’Y[i]m≀1βˆ’Pβˆ—Pβˆ—Z_m \equiv \sum_{i=1}^{k-1} (1 / \theta^*)^{Y_{[k]m} - Y_{[i]m}} \leq \frac{1-P^*}{P^*}

If the difference Y[k]mβˆ’Y[i]mY_{[k]m} - Y_{[i]m} is large, for all ii, then 1/ΞΈβˆ—1 / \theta^* raised to that power will be small, and the summation is likely to be less than (1βˆ’Pβˆ—)/Pβˆ—(1- P^*) / P^*. In other words, we stop when the best population is sufficiently ahead of the remaining populations.

Let's let NN be the random stage mm when the procedure stops. Then, we select the population yielding Y[k]NY_{[k]N} as the one associated with p[k]p_{[k]}.

For example, let k=3k = 3 and (Pβˆ—,ΞΈβˆ—)=(0.75,2)(P^*, \theta^*) = (0.75, 2). Suppose we obtain the following sequence of vector-observations using BBKS\mathcal B_{\text{BKS}}:

mX1mX2mX3mY1mY2mY3mZm11011011.520111121.030111230.7540011240.37551112350.37561013360.25\begin{array}{c|ccc|ccc|c} m & X_{1m} & X_{2m} & X_{3m} & Y_{1m} & Y_{2m} & Y_{3m} & Z_m \\ \hline 1 & 1 & 0 & 1 & 1 & 0 & 1 & 1.5 \\ 2 & 0 & 1 & 1 & 1 & 1 & 2 & 1.0 \\ 3 & 0 & 1 & 1 & 1 & 2 & 3 & 0.75 \\ 4 & 0 & 0 & 1 & 1 & 2 & 4 & 0.375 \\ 5 & 1 & 1 & 1 & 2 & 3 & 5 & 0.375 \\ 6 & 1 & 0 & 1 & 3 & 3 & 6 & 0.25 \\ \end{array}

Given Pβˆ—=0.75P^* = 0.75, we stop when Zm≀(1βˆ’0.75)/0.75=1/3Z_m \leq (1-0.75)/0.75 = 1/3. After the N=6N = 6 round of sampling, we get Z6=0.25≀1/3Z_6 = 0.25 \leq 1/3, and we choose the population with the largest YimY_{im} as the one associated with p[k]p_{[k]}. In this case, we choose population three.

Multinomial Cell Selection

In this lesson, we will talk about the multinomial selection problem, which corresponds to the problem of finding the most-probable category. This selection problem has applications in surveys, simulations, and more.

Multinomial Selection - Intro

We want to select the multinomial cell (category) having the largest probability. For example:

  • Who is the most popular political candidate?
  • Which television show is most-watched during a particular time slot?
  • Which simulated warehouse configuration is most likely to maximize throughput?

These questions are different than the Bernoulli examples we looked at previously. In those examples, the populations succeeded or failed independently of one another. In this case, the various cells are competing against each other.

Yet again, we will take the indifference-zone approach, albeit in a slightly different way than in the binomial procedure.

Here's the experimental setup. We have kk possible outcomes. The probability that we will select a given category, ii, as being the most probable is pip_i. We will take nn independent replications of the experiment and let YiY_i equal the number of outcomes falling in category ii after taking the nn observations.

Let's look at a definition. Suppose we have a kk-variate discrete vector random variable Y=(Y1,Y2,...,Yk)\bold Y = (Y_1, Y_2,...,Y_k) with the following probability mass function:

P{Y1=y1,Y2=y2,...,Yk=yk}=n!∏i=1kyi!∏i=1kpiyiP\{Y_1 = y_1, Y_2 = y_2,...,Y_k = y_k\} = \frac{n!}{\prod_{i=1}^ky_i!}\prod_{i=1}^k p^{y_i}_i

If YY has the probability mass function, then Y\bold Y has the multinomial distribution with parameters nn and p=(p1,...,pk)\bold p = (p_1,...,p_k), where βˆ‘i=1kpi=1\sum_{i=1}^k p_i = 1 and pi>0p_i > 0 for all ii. The multinomial generalizes the binomial.


Suppose three of the faces of a fair die are red, two are blue, and one is violet. Therefore p=(3/6,2/6,1/6)\bold p = (3/6, 2/6, 1/6). Let's toss the die n=5n=5 times. Then the probability of observing exactly three reds, no blues, and two violets is:

P{Y=(3,0,2)}=5!3!0!2!(3/6)3(2/6)0(1/6)2=0.03472P\{\bold Y = (3,0,2)\} = \frac{5!}{3!0!2!}(3/6)^3(2/6)^0(1/6)^2 = 0.03472

Now suppose that we did not know the probabilities for red, blue, and violet, and we want to select the most probable color. Using the selection rule, we choose the color the occurs the most frequently during the five trials, using randomization to break ties.

Let Y=(Yr,Yb,Yv)\bold Y = (Y_r, Y_b, Y_v) denote the number of occurrences of (red, blue, violet) in five trials. The probability we correctly select red is:

P{redΒ winsΒ inΒ 5Β trials}=P{Yr>Yb,Yr>Yv}+0.5P{Yr=Yb,Yr>Yv}+0.5P{Yr=Yb,Yr>Yv}=P{Y=(5,0,0),(4,1,0),(4,0,1),(3,2,0),(3,1,1),(3,0,2)}+0.5P{Y=(2,2,1)}+0.5P{Y=(2,1,2)}\begin{alignedat}{1} & P\{\text{red wins in 5 trials}\} \\ & = P\{Y_r > Y_b, Y_r > Y_v\} + 0.5P\{Y_r = Y_b, Y_r > Y_v\} \\ & \quad + 0.5P\{Y_r = Y_b, Y_r > Y_v\} \\ & = P\{\bold Y = (5,0,0), (4,1,0), (4,0,1), (3,2,0), (3,1,1), (3,0,2)\} \\ & \quad + 0.5P\{\bold Y = (2,2,1)\} + 0.5P\{\bold Y = (2,1,2)\} \\ \end{alignedat}

Notice that we have a coefficient of 0.5 in front of the cases where red ties because we randomly break two-way ties, selecting each color with a probability of 0.5.

We can list the outcomes favorable to a correct selection of red, along with the associated probabilities (calculated using the multinomial pmf above), randomizing in case of ties:

OutcomeContribution(red,Β blue,Β violet)toΒ P{redΒ winsΒ inΒ 5Β trials}(5,0,0)0.03125(4,1,0)0.10417(4,0,1)0.05208(3,2,0)0.13889(3,1,1)0.13889(3,0,2)0.03472(2,2,1)(0.5)(0.13889)(2,1,2)(0.5)(0.06944)0.60416\begin{array}{c|c} \text{Outcome} & \text{Contribution} \\ \text{(red, blue, violet)} & \text{to } P\{\text{red wins in 5 trials}\} \\ \hline (5,0,0) & 0.03125 \\ (4,1,0) & 0.10417 \\ (4,0,1) & 0.05208 \\ (3,2,0) & 0.13889 \\ (3,1,1) & 0.13889 \\ (3,0,2) & 0.03472 \\ (2,2,1) & (0.5)(0.13889) \\ (2,1,2) & (0.5)(0.06944) \\ \hline & 0.60416 \end{array}

As we can see, in this case, the probability of correct selection is 0.6042, given n=5n=5 observations. We can increase the probability of correct selection by increasing nn.

Example 2

The most probable alternative might be preferable to that having the largest expected value. Consider two inventory policies, AA and BB. The profit from AA is $5 with probability one, and the profit from BB is $0 with probability 0.99 and $1000 with probability 0.01.

The expected profit from AA is $5 and the expected profit from BB is $10: 1000βˆ—0.01+0=101000 * 0.01 + 0 = 10. However, the actual profit from AA is better than the profit from BB with probability 0.99. Even E[A]<E[B]E[A] < E[B], AA wins almost all of the time.

Multinomial Procedure & Extensions

In this lesson, we'll look at the single-stage procedure for selecting the most-probable cell along with a few extensions.

Assumptions and Notation for Selection Problem

Let's let Xj=(X1j,...,Xkj),jβ‰₯1\bold X_j = (X_{1j},...,X_{kj}), \enspace j \geq 1 be independent observations taken from a multinomial distribution having kβ‰₯2k \geq 2 categories with associated unknown probabilities p=(p1,...,pk)\bold p = (p_1,...,p_k).

For a fixed jj, XjX_j refers to one multinomial observation, which is of length kk, corresponding to each of the kk categories. For example, if we have k=8k=8 political candidates to choose from, XjX_j has k=8k=8 entries.

The entry XijX_{ij} equals one if category ii occurs on the jjth observation. Otherwise, Xij=0X_{ij} = 0. For example, if we ask someone to choose from eight different candidates, and they choose candidate one, then X1j=1X_{1j} = 1 and X2j=β‹―=Β X8j=0X_{2j} = \cdots = \ X_{8j} = 0.

We can order the underlying, unknown pip_i's like so: p[1]≀⋯≀p[k]p_{[1]} \leq \cdots \leq p_{[k]}. The category with p[k]p_{[k]} is the most probable or best category.

We will survey many people and count up the number of times they select each category. The cumulative sum for category ii after we take mm multinomial observations is:

Yim=βˆ‘j=1mXijY_{im} = \sum_{j = 1}^m X_{ij}

The ordered YimY_{im}'s are Y[1]m≀⋯≀Y[k]mY_{[1]m} \leq \cdots \leq Y_{[k]m}. We select the candidate associated with Y[k]mY_{[k]m} as being associated with p[k]p_{[k]}.

Our goal in this problem is to select the category associated with p[k]p_{[k]}, and we say that a correct selection is made if the goal is achieved. For specified (Pβˆ—,ΞΈβˆ—)(P^*, \theta^*), where 1/k<Pβˆ—<11/k < P^* < 1 and ΞΈβˆ—>1\theta^* > 1, we have the following probability requirement:

P(CS)β‰₯Pβˆ—wheneverp[k]/p[kβˆ’1]β‰₯ΞΈβˆ—P(\text{CS}) \geq P^* \quad\text{whenever}\quad p_{[k]} / p_{[k-1]} \geq \theta^*

When the ratio of the best category to the second-best category eclipses a certain ΞΈβˆ—\theta^* that we specify, we want to ensure that we make the right selection with at least probability Pβˆ—P^*. We can interpret ΞΈβˆ—\theta^* as the smallest p[k]/p[kβˆ’1]p_{[k]} / p_{[k-1]} ratio worth detecting.

Single-Stage Procedure

Let's look at the single-stage procedure, MBEM\mathcal M_{\text{BEM}}, by Bechhofer, Elmaghraby, and Morse, from 1959. For a given kk, Pβˆ—P^*, and ΞΈβˆ—\theta^*, we find the required number of multinomial observations, nn, from a table.

First, we take nn multinomial observations (X1j,...,Xkj)1≀j≀n(X_{1j},...,X_{kj}) \enspace 1 \leq j \leq n in a single stage. From there, we calculate the ordered sample sums, Y[1]n≀⋯≀Y[k]nY_{[1]n} \leq \cdots \leq Y_{[k]n}. Finally, we take the category with the largest sum, Y[k]nY_{[k]n} as the one associated with p[k]p_{[k]}, randomizing to break ties.

The nn-values are computed so that MBEM\mathcal M_{\text{BEM}} achieves P(CS)β‰₯Pβˆ—P(\text{CS}) \geq P^* when the cell probabilities p\bold p are in the least-favorable configuration:

p[1]=p[kβˆ’1]=1/(ΞΈβˆ—+kβˆ’1),p[k]=ΞΈβˆ—/(ΞΈβˆ—+kβˆ’1)p_{[1]} = p_{[k-1]} = 1/(\theta^* + k - 1), p_{[k]} = \theta^* / (\theta^* + k - 1)

In this configuration, the best category is ahead of all of the other categories by a factor of exactly ΞΈ\theta, and the other pip_i's are all identical.

In any event, here is the table, parameterized by kk, Pβˆ—P^*, and ΞΈβˆ—\theta^* that guarantees the appropriate probability of correct selection.


A soft drink producer wants to find the most popular of k=3k=3 proposed cola formulations. The company will give a taste test to nn people. The sample size nn is to be chosen so that P(CS)β‰₯0.95P(\text{CS}) \geq 0.95 whenever the ratio of the largest to second largest true (but unknown) proportions is at least 1.4.

From the table, we see that, for k=3k=3, Pβˆ—=0.95P^* = 0.95, and ΞΈβˆ—=1.4\theta^* = 1.4, we need to take n=186n = 186 multinomial observations. Let's take those observations and assume that we find Y1,186=53Y_{1,186} = 53, Y2,186=110Y_{2,186} = 110, and Y3,186=26Y_{3,186 = 26}. Then, we select formulation two as the best.


We can use curtailment in multinomial cell selection and stop sampling when the category in second place cannot win. Additionally, we can use a sequential procedure and take observations one at a time until the winner is clear.

OMSCS Notes is made with in NYC by Matt Schlenker.

Copyright Β© 2019-2023. All rights reserved.

privacy policy