Audio timescale-pitch modification

Time stretching is the process of changing the speed or duration of an audio signal without affecting its pitch. Pitch scaling or pitch shifting is the opposite: the process of changing the pitch without affecting the speed. Similar methods can change speed, pitch, or both at once, in a time-varying way.

These processes are used, for instance, to match the pitches and tempos of two pre-recorded clips for mixing when the clips cannot be reperformed or resampled. (A drum track containing no pitched instruments could be moderately resampled for tempo without adverse effects, but a pitched track could not). They are also used to create effects such as increasing the range of an instrument (like pitch shifting a guitar down an octave).

The simplest way to change the duration or pitch of a digital audio clip is to resample it. This is a mathematical operation that effectively rebuilds a continuous waveform from its samples and then samples that waveform again at a different rate. When the new samples are played at the original sampling frequency, the audio clip sounds faster or slower. Unfortunately, the frequencies in the sample are always scaled at the same rate as the speed, transposing its perceived pitch up or down in the process. In other words, slowing down the recording lowers the pitch, speeding it up raises the pitch, and using this method the two effects cannot be separated. This is analogous to speeding up or slowing down an analogue recording, like a phonograph record or tape, creating the Chipmunk effect.

In order to preserve an audio signal's pitch when stretching or compressing its duration, many time-scale modification (TSM) procedures follow a frame-based approach. Given an original discrete-time audio signal, this strategy's first step is to split the signal into short analysis frames of fixed length. The analysis frames are spaced by a fixed number of samples, called the analysis hopsize $H_{a}\in \mathbb {N}$ . To achieve the actual time-scale modification, the analysis frames are then temporally relocated to have a synthesis hopsize $H_{s}\in \mathbb {N}$ . This frame relocation results in a modification of the signal's duration by a stretching factor of $\alpha =H_{s}/H_{a}$ . However, simply superimposing the unmodified analysis frames typically results in undesired artifacts such as phase discontinuities or amplitude fluctuations. To prevent these kinds of artifacts, the analysis frames are adapted to form synthesis frames, prior to the reconstruction of the time-scale modified output signal.

...
Wikipedia