Audio time stretching

Time stretching is the process of changing the speed of an audio signal without affecting its pitch. Pitch scaling (often called pitch shifting, but this is a misnomer) is the reverse: the process of changing the pitch without affecting the speed.

This introduction was copied from Damian Yerrick's E2 writeup. It will need to be rewritten from an encyclopedic neutral point of view.

OK, I have a song stored as 2-channel, 16-bit linear PCM on my reasonably fast computer. I want to slow down the tempo because I'm trying to remix with another song.
"Re-perform it!" No, I don't have the source score or samples, and I don't have the vocal training; all I have is this wav file I extracted from a CD.
"Resample it!" No, resampling digital audio has an effect analogous to that of slowing down a phonograph turntable: it transposess the song to a lower key and makes the singer sound like an ogre.

Phase vocoder

One way of stretching a signal is to build a phase vocoder after Flanagan, Golden, and Portnoff. Basic steps: compute the frequency/time relationship of the signal by taking the Fast Fourier Transform of each windowed block of 2,048 samples (assuming 44 kHz input), do some processing of the frequencies' amplitudes and phases, and perform the inverse FFT. A good algorithm will give good results at compression/expansion ratios of ± 25%; beyond that, the pre-echo and other smearing artifacts of frequency domain interpolation on transient ("beat") waveforms, which are not localized at all in the frequency domain, begin to take a toll on perceived audio quality.

Time domain

Rabiner and Schafer in 1978 put forth an alternate solution: work in the time domain, attempt to find the period of a given section of the fundamental wave with the autocorrelation function, and crossfade one period into another. This is called time domain harmonic scaling or synchronized overlap-add method and performs somewhat faster than the phase vocoder on slower machines but fails when the autocorrelation misunderestimates the period of a signal with complicated harmonics (such as orchestral pieces). Cool Edit Pro seems to solve this by looking for the period closest to a center period that the user specifies, which should be an integer multiple of the tempo, and between 30 Hz and the lowest bass frequency. For a 120 bpm tune, use 48 Hz because 48 Hz = 2,880 cycles/minute = 24 cycles/beat * 120 bpm.

High-end commercial audio processing packages combine the two techniques, using wavelet techniques to separate the signal into sinusoid and transient waveforms, applying the phase vocoder to the sinusoids, and processing transients in the time domain, producing the highest quality time stretching.

Pitch scaling

These techniques can also be used to scale the pitch of an audio sample while holding time constant. (Note that the technique is properly called pitch scaling, not "shifting," as pitch shifting by amplitude modulation with a complex exponential does not preserve the ratios of the harmonic frequencies that determine the sound's timbre.) Time domain processing works much better here, as smearing is less noticeable, but scaling vocal samples distorts the formants into a sort of Alvin and the Chipmunks-like effect, which may be desirable or undesirable. To preserve the formants and character of the voice, you can use a "regular" channel vocoder keyed to the signal's fundamental frequency. (Following a single voice's fundamental is straightforward; put a note in Damian's talk page if you want more information.)

Audio time stretching

Phase vocoder

Time domain

Pitch scaling

Additional resources