The theorem states that, when converting from an analog signal to digital (or otherwise sampling a signal at discrete intervals), the sampling frequency must be greater than twice the highest frequency of the input signal in order to be able to reconstruct the original perfectly from the sampled version.
If the sampling frequency is less than this limit, then frequencies in the original signal that are above half the sampling rate will be "aliased" and will appear in the resulting signal as lower frequencies. Therefore, an analog low-pass filter is typically applied before sampling to ensure that no components with frequencies greater than half the sample frequency remain. This is called an "anti-aliasing filter".
The theorem also applies when reducing the sampling frequency of an existing digital signal.
The theorem was first formulated by Harry Nyquist in 1928 ("Certain topics in telegraph transmission theory"), but was only formally proved by Claude E. Shannon in 1949 ("Communication in the presence of noise"). Mathematically, the theorem is formulated as a statement about the Fourier transform.
If a function s(x) has a Fourier transform F[s(x)] = S(f) = 0 for |f| > W, then it is completely determined by giving the value of the function at a series of points spaced 1/(2W) apart. The values sn = s(n/(2W)) are called the samples of s(x).
The minimum sample frequency that allows reconstruction of the original signal, that is 2W samples per unit distance, is known as the Nyquist frequency, (or Nyquist rate). The time inbetween samples is called the Nyquist interval.
If S(f) = 0 for |f| > W, then s(x) can be recovered from its samples by the Nyquist-Shannon Interpolation Formula.
It has to be noted that even if the concept of "twice the highest frequency" is the more commonly used idea, it is not absolute. In fact the theorem stand for "twice the bandwidth", which is totally different. Bandwidth is related with the range between the first frequency and the last frequency that represent the signal. Bandwidth and highest frequency are identical only in baseband signals, that is, those that go very nearly down to DC. This concept lead to what is called undersampling, that is very used in software-defined radio.
Imagine that you want to sample all the FM commercial radio stations that broadcast in a given area. They broadcast in channels that span from 88 MHz to 108 MHz, giving a signal with bandwidth of 20 MHz. In the baseband interpretation of the theorem, this would require a sampling frequency more than 216 MHz. In fact, doing undersampling one only require to sample at more than 40 MHz, as long as you pass the antenna signal by a bandpass filter that only keep the 88-108 MHz range. Sampling at 44 MHz the frequency 100 MHz will be reflected as a 12 MHz digital frequency.
In certain problems, the frequencies of interest are not an interval of frequencies, but perhaps some more interesting set F of frequencies. Again, the sampling frequency must be proportional to the size of F. For instance, certain domain decomposition methods fail to converge for the 0th frequency (the constant mode) and some medium frequencies. Then the set of interesting frequencies would be something like 10Hz to 100Hz, and 110Hz to 200Hz. In this case, one would need to sample at 360Hz, not 400Hz, to correctly capture these signals.
It may even be the case that the space of interesting signals don't correspond neatly to a band of frequencies. Perhaps in a problem, instead of having the FM frequencies (from 88MHz to 108 MHz), we are instead interested in reconstructing cubic polynomials. Cubic polynomials don't correspond to a frequency range, but they do form a four dimensional subspace of the space of all possible signals, and by that token, four equidistant samples will suffice to reconstruct an arbitrary cubic polynomial. (See also the fundamental theorem of algebra and polynomial interpolation.)
Hence, the Nyquist-Shannon sampling theorem is a kind of statement about the size of the vector spaces in question.
References:
Undersampling