MPEG-2 Transport Stream Explained

I have been looking into open source projects for broadcasting H.264 over MPEG-2 transport stream and have been thoroughly disappointed. The most promising project at the moment is libmpegts from Kieran Kunhya which was made for the Open Broadcast Encoder project. Unfortunately at the time of this writing, that multiplexer is immature and the licensing of the project is too restrictive for many projects. I have actively helped him wherever possible interpreting the MPEG-2 and SMPTE specifications which are cryptic at times. For an offline decoder, there is a project I wrote in the past here which is a transport stream multiplexer with support for MPEG-2 video as well as a few audio types. I haven't worked on it for a while, but eventually I hope to do something more with it.

I'm going to put some effort into documenting MPEG-2 transport stream from the perspective of someone who would need to implement it. There is some documentation out there, but frankly, the nature of this documentation is also cryptic at times.

For the purpose of this document, I'll start by defining the needs of a multiplexer. Be aware that this document exists for the broadcast specification of transport stream, not the Blu-Ray spec. There are some differences, but since I have never concerned myself with producing BluRay, I can't point them out in detail.

Fixed Packet Size

TS packets are 188 bytes long. Yes, they can be longer, but any code which can read and handle those 188 byte will work on the larger packet sizes as well. 192 byte packets used for BluRay is just the 188 byte packet with a 4 byte timecode prepended to each packet. The 208 byte packet is simply the 188 byte packet with 20 bytes of Reed Solomon error correction appended to the packet. Reed Solomon is less common now than it was in the old days since TS is transmitted mostly over Ethernet via RTP. In the IP context, it is more common to use RTP forward error correction. That tends to add more latency to the stream, but is more useful in a "lossy context".

The 188 byte packet size of transport stream was chosen for an actual reason, not just to annoy people. ATM was the dominant digital infrastructure technology at the time which transport stream was being designed. At that time, the idea of bundles of 2000 fibers across the Atlantic wasn't realistic. Companies couldn't just call Cable and Wireless and say "I'd like to rent one of the fibers in your cross Atlantic bundle". Instead, the fiber was time shared and ATM was perfect for this since it was an insanely low latency switched network design. From a protocol perspectively, it was insanely simple (bordering on stupid) and provided a means for guaranteeing bandwidth within a connection to whoever was buying from that fiber. So a single 655 megabit per second pair of fibers could provide 600 customers a guaranteed 1 megabit of bandwidth (there's about 10% header overhead).

The boring history finished now, the payload of an ATM "cell" is 48 bytes in length. Four times that is 192 bytes and I have never figured out what the extra 4 bytes was for, but I'm guessing it had something to do with routing or switching. Either way, this is how we got 188 bytes for a transport stream packet. It was small enough to be low latency and big enough that it wasn't completely wasting bandwidth on packet headers.

Constant Bit Rate

An MPEG-2 transport stream in a broadcast sense is constant bit-rate. In fact, it is so constant that most broadcast networks will start generating alarms on their receivers if there is even minor deviations in the bit rate of the stream. It become clear why this is the case in the next point.

Transport stream IS the protocol/multiplex

Transport stream IS the multiplex. Unlike IP based networks where you could create a socket for video, another for audio, another for teletext etc… think of transport stream as a ride at Disney where there is a long track with a ring of cars connected end to end throughout the entire track in a loop. People step into the moving cars when they get on the ride and they step off at the end of the ride. The cars never stop moving and they always move at precisely the same speed. Each packet in transport stream is one of these cars. A piece of video information hops onto a car at the start of the ride and hops off at the end of the ride. If there's no one there to get on the next car, the cars keep passing until the next passenger comes along. Every X number of cars, a guy with a clip board to keep track of the groups hops on for the ride. He's carrying "program, stream and network" information which defines what data is found in this multiplex.

In the case of transport stream over IP, it is becoming more common to "stop the ride if there's no passengers" or to not bother with NULL packets to pad the stream out as constant bit rate since there's no real benefit in IP to constantly spewing data over the network. In the case of ATM and DVB-ASI (to be explained later… maybe) the packets will be transmitted no matter what, they have to be. Filling them with null packets makes it so there aren't a bunch of blank or random packets on the line. They at least have a header to say they should be ignored. Also, this allows the receiver to synchronize times with the transmitter.

Transport stream packets can and do exist perfectly without any other network topology. So long as there is a means of transmitting the 188 bytes packets from the transmitter to the receiver, everything needed is included in the transport stream, from timing to subtitles.

There is more to MPEG transport than just Transport Stream

Many of the multiplexers out there (such as the one included in VideoLAN Client and the mpeg2mux component of GStreamer) have a bad habit of just spewing packets onto the network as they become available. This is considered bad behavior and should be avoided. Audio is generally encoded using constant bit rate streams, so audio for a given interval will always use the same number of bytes as audio from another interval. Video on the other hand almost never does. MPEG-2 Video and H.264 video, while it can be massaged into a constant bit rate format tends to have a great deal of variance. A B frame from a sequence that started off black and has been nothing but black should only take a few bytes to encode, but a high speed chase scene from "The Fast and the Furious" will probably consume as much bandwidth as the encoder is allowed to use.

A subject which is always a little hard for software developers to get use to is that memory isn't free. It costs money. The crumby set top boxes distributed by cable providers tend to use single chip solutions for almost the entire device. Sometimes these restrictions are so great that the protocol itself has to be massaged to make sure that there is enough memory to decode the picture at just the right time. It is entirely possible that a decoder doesn't even have enough RAM to store the compressed image of a single frame. A perfect transport stream multiplexer takes these rules into consideration and makes sure to output the data at a rate so that the decoder can start decoding the video pictures before the entire frame has even arrived.

Transport Stream defines a mechanism called the T-STD Buffer model which specifies precisely how a minimum implementation of the receiver buffer should perform. This means that the multiplexer can know precisely at any given time (relative to the stream which also drives the clock on the receiver) how many bytes are available for storing more received data before decoding it.

I won't detail the insanely low buffer spec as it has become irrelevant, but I will detail the scenario where the receiver may have at most, enough room to store a single frame of video in the input buffer at a given time.

The decoder decodes data at the time specified by the multiplexer.

MPEG Video compression schemes output images not in the order they should be displayed, but instead in the order which they need to be decoded. This is required for "bi-directionally predicted" frames which required image data from frames that will display later in order to produce the current frame. Therefore it is necessary to decode the future frame before decoding the current frame. Otherwise the data won't be available for the current frame. The multiplexer is responsible for telling the decoder when it should decode frames. So, each frame that is transmitted contains both a "Presentation Time Stamp (PTS)", the time which the frame should be displayed as well as a "Decoder Time Stamp (DTS)" the time which the data for the frame should be fed to the video decoder. This data also exists in the video streams themselves, but neither in a way which corresponds to other streams (such as audio) or in a way which is consistent across all media codecs and therefore would be very difficult to parse for every media type which might be received.

MPEG was designed by analog freaks!

While it's often possible for a programmer to learn enough about digital electronics to start working on VHDL or Verilog projects, there is almost no connection between the mind set of a programmer and the mind set of a radio electronics engineer. When reading the ISO13818-1:2000 specification (which covers transport stream) it is very easy to get lost in piles of criteria such as tolerances and just senseless math. This information is probably useful in the case of defining tests for certification equipment, but is borderline useless and often counter-productive in the case of someone who would implement this protocol. Be warned!

Also, keep in mind that the values in transport stream are of insanely stupid sizes as if whoever chose them was counting the number of transistors needed to implement each value. A stream identifier is 13-bits long. A clock reference is 33 bits long. The clock reference itself is timed at 27Mhz. The synchronization clock cycle is 90Khz. Blah blah blah. There are reasons for some of these numbers, but it boils down to things like "We can transmit a serial signal at 270Mhz using 8b/10b encoding meaning that precisely 27 million bytes of data can be moved over this cable per second." etc…