This is mostly just a reference post for me to assemble thoughts on what I’ve read about video encoding over the past couple of days. It might not make much sense.
First, a good starting point is the book H.264 – Advanced Video Compression Standard by Iain E. Richardson. It’s where I’ve picked up most of the relevant information. The concepts and technologies are pretty involved with a lot of terminology, so the book was helpful in that regard. But if you read Chapter 3, Video Coding Concepts, it’s a lot easier to understand H.264 and V8.
So how does a video stream go from raw to compressed? It’s actually a fairly interesting process. In most cases, it goes something like this (although implementations/specs may vary):
1. An input video frame, F, is processed in macroblocks, usually 16 x 16 pixel area of the frame.
2. A motion estimation function finds a 16 x 16 pixel region in the reference frame (usually the previous frame encoded) that most closely matches the current macroblock. The offset between the previous frame’s position and current is calculated – this information helps the compression algorithm in the next step and enables compression to continue in the cases where the camera is panning. This offset is called the motion vector.
3. Using the motion vector, a prediction is generated which basically states what the current frame might look like based on the previous frame and motion vector.
4. The prediction is subtracted from the current macroblock and produces a residual macroblock. This residual macroblock represents the regions that have changed since the previous frame, taking into account motion of the camera itself. Therefore, we’ve now reduced transmission of an entire frames data into only the residual data required to represent change.
5. Now, the residual macroblock undergoes a transformation to reduce the information contents further. This transformation is typically in the form of a Discrete Cosine Transformation, which basically takes a macroblock and translates the values into coefficients of a matrix. This lets the algorithm chop off some of the coefficients since we can fairly closely reconstruct the matrix from a majority of the coefficients. Remember, we are working with small pixel regions so being “close enough” is a good trade-off.
6. The coefficients are quantized. Quantization means the removal (or indexing) of unnecessary or repetitive data. In some cases, this can mean a lookup table that is shared by the encoder or decoder for common coefficient sequences.
7. These quantized coefficients are reordered. Since it’s likely that there are a large number of zero-valued coefficients, special encoding is used to compress these blocks and only represent the non-zero coefficients, called run-level encoding.
8. The coefficients, motion vector and all other header information required for the decoder to recreate the original frames is entropy encoded to create a bitstream. Examples of entropy encoding are Huffman codes or arithmetic coding techniques.
A reverse process is used by the decoder to reproduce the frame/macroblock.