MPEG-FAQ 4.0: What is MPEG ?
MPEG-FAQ 4.0:

What is MPEG ?
From comp.compression Mon Oct 19 15:38:38 1992
Sender: news@chorus.chorus.fr
Author: Mark Adler <madler@cco.caltech.edu>
[71] Introduction to MPEG (long)
What is MPEG?
Does it have anything to do with JPEG?
Then what's JBIG and MHEG?
What has MPEG accomplished?
So how does MPEG I work?
What about the audio compression?
So how much does it compress?
What's phase II?
When will all this be finished?
How do I join MPEG?
How do I get the documents, like the MPEG I draft?
[ There is no newer version of this part so far. Whoever wants to update ]
[ this description, should do the job and send it over. ]
Written by Mark Adler <madler@cco.caltech.edu>.
Q. What is MPEG?
A. MPEG is a group of people that meet under ISO (the International
Standards Organization) to generate standards for digital video
(sequences of images in time) and audio compression. In particular,
they define a compressed bit stream, which implicitly defines a
decompressor. However, the compression algorithms are up to the
individual manufacturers, and that is where proprietary advantage
is obtained within the scope of a publicly available international
standard. MPEG meets roughly four times a year for roughly a week
each time. In between meetings, a great deal of work is done by
the members, so it doesn't all happen at the meetings. The work
is organized and planned at the meetings.
Q. So what does MPEG stand for?
A. Moving Pictures Experts Group.
Q. Does it have anything to do with JPEG?
A. Well, it sounds the same, and they are part of the same subcommittee
of ISO along with JBIG and MHEG, and they usually meet at the same
place at the same time. However, they are different sets of people
with few or no common individual members, and they have different
charters and requirements. JPEG is for still image compression.
Q. Then what's JBIG and MHEG?
A. Sorry I mentioned them. Ok, I'll simply say that JBIG is for binary
image compression (like faxes), and MHEG is for multi-media data
standards (like integrating stills, video, audio, text, etc.).
For an introduction to JBIG, see question 74 below.
Q. Ok, I'll stick to MPEG. What has MPEG accomplished?
A. So far (as of January 1992), they have completed the "Committee
Draft" of MPEG phase I, colloquially called MPEG I. It defines
a bit stream for compressed video and audio optimized to fit into
a bandwidth (data rate) of 1.5 Mbits/s. This rate is special
because it is the data rate of (uncompressed) audio CD's and DAT's.
The draft is in three parts, video, audio, and systems, where the
last part gives the integration of the audio and video streams
with the proper timestamping to allow synchronization of the two.
They have also gotten well into MPEG phase II, whose task is to
define a bitstream for video and audio coded at around 3 to 10
Mbits/s.
Q. So how does MPEG I work?
A. First off, it starts with a relatively low resolution video
sequence (possibly decimated from the original) of about 352 by
240 frames by 30 frames/s (US--different numbers for Europe),
but original high (CD) quality audio. The images are in color,
but converted to YUV space, and the two chrominance channels
(U and V) are decimated further to 176 by 120 pixels. It turns
out that you can get away with a lot less resolution in those
channels and not notice it, at least in "natural" (not computer
generated) images.
The basic scheme is to predict motion from frame to frame in the
temporal direction, and then to use DCT's (discrete cosine
transforms) to organize the redundancy in the spatial directions.
The DCT's are done on 8x8 blocks, and the motion prediction is
done in the luminance (Y) channel on 16x16 blocks. In other words,
given the 16x16 block in the current frame that you are trying to
code, you look for a close match to that block in a previous or
future frame (there are backward prediction modes where later
frames are sent first to allow interpolating between frames).
The DCT coefficients (of either the actual data, or the difference
between this block and the close match) are "quantized", which
means that you divide them by some value to drop bits off the
bottom end. Hopefully, many of the coefficients will then end up
being zero. The quantization can change for every "macroblock"
(a macroblock is 16x16 of Y and the corresponding 8x8's in both
U and V). The results of all of this, which include the DCT
coefficients, the motion vectors, and the quantization parameters
(and other stuff) is Huffman coded using fixed tables. The DCT
coefficients have a special Huffman table that is "two-dimensional"
in that one code specifies a run-length of zeros and the non-zero
value that ended the run. Also, the motion vectors and the DC
DCT components are DPCM (subtracted from the last one) coded.
Q. So is each frame predicted from the last frame?
A. No. The scheme is a little more complicated than that. There are
three types of coded frames. There are "I" or intra frames. They
are simply a frame coded as a still image, not using any past
history. You have to start somewhere. Then there are "P" or
predicted frames. They are predicted from the most recently
reconstructed I or P frame. (I'm describing this from the point
of view of the decompressor.) Each macroblock in a P frame can
either come with a vector and difference DCT coefficients for a
close match in the last I or P, or it can just be "intra" coded
(like in the I frames) if there was no good match.
Lastly, there are "B" or bidirectional frames. They are predicted
from the closest two I or P frames, one in the past and one in the
future. You search for matching blocks in those frames, and try
three different things to see which works best. (Now I have the
point of view of the compressor, just to confuse you.) You try using
the forward vector, the backward vector, and you try averaging the
two blocks from the future and past frames, and subtracting that from
the block being coded. If none of those work well, you can intra-
code the block.
The sequence of decoded frames usually goes like:
IBBPBBPBBPBBIBBPBBPB...
Where there are 12 frames from I to I (for US and Japan anyway.)
This is based on a random access requirement that you need a
starting point at least once every 0.4 seconds or so. The ratio
of P's to B's is based on experience.
Of course, for the decoder to work, you have to send that first
P *before* the first two B's, so the compressed data stream ends
up looking like:
0xx312645...
where those are frame numbers. xx might be nothing (if this is
the true starting point), or it might be the B's of frames -2 and
-1 if we're in the middle of the stream somewhere.
You have to decode the I, then decode the P, keep both of those
in memory, and then decode the two B's. You probably display the
I while you're decoding the P, and display the B's as you're
decoding them, and then display the P as you're decoding the next
P, and so on.
Q. You've got to be kidding.
A. No, really!
Q. Hmm. Where did they get 352x240?
A. That derives from the CCIR-601 digital television standard which
is used by professional digital video equipment. It is (in the US)
720 by 243 by 60 fields (not frames) per second, where the fields
are interlaced when displayed. (It is important to note though
that fields are actually acquired and displayed a 60th of a second
apart.) The chrominance channels are 360 by 243 by 60 fields a
second, again interlaced. This degree of chrominance decimation
(2:1 in the horizontal direction) is called 4:2:2. The source
input format for MPEG I, called SIF, is CCIR-601 decimated by 2:1
in the horizontal direction, 2:1 in the time direction, and an
additional 2:1 in the chrominance vertical direction. And some
lines are cut off to make sure things divide by 8 or 16 where
needed.
Q. What if I'm in Europe?
A. For 50 Hz display standards (PAL, SECAM) change the number of lines
in a field from 243 or 240 to 288, and change the display rate to
50 fields/s or 25 frames/s. Similarly, change the 120 lines in
the decimated chrominance channels to 144 lines. Since 288*50 is
exactly equal to 240*60, the two formats have the same source data
rate.
Q. You didn't mention anything about the audio compression.
A. Oh, right. Well, I don't know as much about the audio compression.
Basically they use very carefully developed psychoacoustic models
derived from experiments with the best obtainable listeners to
pick out pieces of the sound that you can't hear. There are what
are called "masking" effects where, for example, a large component
at one frequency will prevent you from hearing lower energy parts
at nearby frequencies, where the relative energy vs. frequency
that is masked is described by some empirical curve. There are
similar temporal masking effects, as well as some more complicated
interactions where a temporal effect can unmask a frequency, and
vice-versa.
The sound is broken up into spectral chunks with a hybrid scheme
that combines sine transforms with subband transforms, and the
psychoacoustic model written in terms of those chunks. Whatever
can be removed or reduced in precision is, and the remainder is
sent. It's a little more complicated than that, since the bits
have to be allocated across the bands. And, of course, what is
sent is entropy coded.
Q. So how much does it compress?
A. As I mentioned before, audio CD data rates are about 1.5 Mbits/s.
You can compress the same stereo program down to 256 Kbits/s with
no loss in discernable quality. (So they say. For the most part
it's true, but every once in a while a weird thing might happen
that you'll notice. However the effect is very small, and it takes
a listener trained to notice these particular types of effects.)
That's about 6:1 compression. So, a CD MPEG I stream would have
about 1.25 MBits/s left for video. The number I usually see though
is 1.15 MBits/s (maybe you need the rest for the system data
stream). You can then calculate the video compression ratio from
the numbers here to be about 26:1. If you step back and think
about that, it's little short of a miracle. Of course, it's lossy
compression, but it can be pretty hard sometimes to see the loss,
if you're comparing the SIF original to the SIF decompressed. There
is, however, a very noticeable loss if you're coming from CCIR-601
and have to decimate to SIF, but that's another matter. I'm not
counting that in the 26:1.
The standard also provides for other bit rates ranging from 32Kbits/s
for a single channel, up to 448 Kbits/s for stereo.
Q. What's phase II?
A. As I said, there is a considerable loss of quality in going from
CCIR-601 to SIF resolution. For entertainment video, it's simply
not acceptable. You want to use more bits and code all or almost
all the CCIR-601 data. From subjective testing at the Japan
meeting in November 1991, it seems that 4 MBits/s can give very
good quality compared to the original CCIR-601 material. The
objective of phase II is to define a bit stream optimized for these
resolutions and bit rates.
Q. Why not just scale up what you're doing with MPEG I?
A. The main difficulty is the interlacing. The simplest way to extend
MPEG I to interlaced material is to put the fields together into
frames (720x486x30/s). This results in bad motion artifacts that
stem from the fact that moving objects are in different places
in the two fields, and so don't line up in the frames. Compressing
and decompressing without taking that into account somehow tends to
muddle the objects in the two different fields.
The other thing you might try is to code the even and odd field
streams separately. This avoids the motion artifacts, but as you
might imagine, doesn't get very good compression since you are not
using the redundancy between the even and odd fields where there
is not much motion (which is typically most of image).
Or you can code it as a single stream of fields. Or you can
interpolate lines. Or, etc. etc. There are many things you can
try, and the point of MPEG II is to figure out what works well.
MPEG II is not limited to consider only derivations of MPEG I.
There were several non-MPEG I-like schemes in the competition in
November, and some aspects of those algorithms may or may not
make it into the final standard for entertainment video compression.
Q. So what works?
A. Basically, derivations of MPEG I worked quite well, with one that
used wavelet subband coding instead of DCT's that also worked very
well. Also among the worked-very-well's was a scheme that did not
use B frames at all, just I and P's. All of them, except maybe one,
did some sort of adaptive frame/field coding, where a decision is
made on a macroblock basis as to whether to code that one as one
frame macroblock or as two field macroblocks. Some other aspects
are how to code I-frames--some suggest predicting the even field
from the odd field. Or you can predict evens from evens and odds
or odds from evens and odds or any field from any other field, etc.
Q. So what works?
A. Ok, we're not really sure what works best yet. The next step is
to define a "test model" to start from, that incorporates most of
the salient features of the worked-very-well proposals in a
simple way. Then experiments will be done on that test model,
making a mod at a time, and seeing what makes it better and what
makes it worse. Example experiments are, B's or no B's, DCT vs.
wavelets, various field prediction modes, etc. The requirements,
such as implementation cost, quality, random access, etc. will all
feed into this process as well.
Q. When will all this be finished?
A. I don't know. I'd have to hope in about a year or less.
Q. How do I join MPEG?
A. You don't join MPEG. You have to participate in ISO as part of a
national delegation. How you get to be part of the national
delegation is up to each nation. I only know the U.S., where you
have to attend the corresponding ANSI meetings to be able to
attend the ISO meetings. Your company or institution has to be
willing to sink some bucks into travel since, naturally, these
meetings are held all over the world. (For example, Paris,
Santa Clara, Kurihama Japan, Singapore, Haifa Israel, Rio de
Janeiro, London, etc.)
Q. Well, then how do I get the documents, like the MPEG I draft?
A. MPEG is a draft ISO standard. It's exact name is ISO CD 11172.
The draft consists of three parts: System, Video, and Audio. The
System part (11172-1) deals with synchronization and multiplexing
of audio-visual information, while the Video (11172-2) and Audio
part (11172-3) address the video and the audio compression techniques
respectively.
You may order it from your national standards body (e.g. ANSI in
the USA) or buy it from companies like
OMNICOM
phone +44 438 742424
FAX +44 438 740154