Binomial distribution
The binomial distribution is a discrete probability distribution which describes the number of successes in a sequence of n independent experiments, each of which yielding success with probability p. Such a success/failure experiment is also called a Bernoulli experiment.
A typical example is the following: 5% of the population are HIV-positive. You pick 500 people randomly. How likely is it that you get 30 or more HIV-positives?
The number of HIV-positives you pick is a random variable X which follows a binomial distribution with n = 500 and p = .05. We are interested in the probability Pr[X ≥ 30].
In general, if the random variable X follows the binomial distribution with parameters n and p, we write X ~ B(n, p). The probability of getting exactly k successes is given by
- Pr[X = k] = C(n, k) pk (1-p)n-k for k = 0, 1, 2, ..., n
Here, C(
n,
k) denotes the
binomial coefficient of
n and
k, whence the name of the distribution. The formula can be understood as follows: we want
k successes (
pk) and
n-
k failures ((1-
p)
n-k). However, the
k successes can occur anywhere among the
n trials, and there are C(
n,
k) different ways of distributing
k successes in a sequence of
n trials.
If X ~ B(n, p), then the expected value of X is
- E[X] = np
and the
variance is
- Var(X) = np(1-p).
The most likely value or
mode of
X is given by the largest integer less than or equal to (
n+1)
p; if
m = (
n+1)
p is itself an integer, then
m-1 and
m are both modes.
If X ~ B(n, p) and Y ~ B(m, p) are independent binomial variables, then X + Y is again a binomial variable; its distribution is B(n+m, p).
Two other important distributions arise as approximations of binomial distributions:
- If both np and n(1-p) are greater than 5 or so, then an excellent approximation to B(n, p) is given by the normal distribution N(np, np(1-p)). This approximation is a huge time saver; historically, it was the first use of the normal distribution. Nowadays, it can be seen as a consequence of the central limit theorem since B(n, p) is a sum of n independent, identically distributed 0-1 indicator variables.
- For example, suppose you randomly sample n people out of a large population and ask them whether they agree with a certain statement. The proportion of people who agree will of course depend on the sample. If you sampled groups of n people repeatedly and truly randomly, the proportions would follow an approximate normal distribution with mean equal to the true proportion p of agreement in the population and with standard deviation σ = (p(1 - p)/n)1/2. Large sample sizes n are good because the standard deviation gets smaller, which allows a more precise estimate of the unknown parameter p.
- If n is large and p is small, so that np is of moderate size, then the Poisson distribution with parameter λ = np is a good approximation to B(n, p).
pictures of these approximations would be nice.
The formula for Bézier curves was inspired by the binomial distribution.