Currently, in the context of the HTTP protocol, two approaches are proposed which enable the caching of media fragments. The first proposal is based on the HTTP byte ranges in range request/responses, while the second is based on the definition of new HTTP range units which are able to describe axis of media fragments.
Using HTTP byte ranges to request media fragments enables existing HTTP proxies and caches to inherently support the caching of media fragments. This approach is possible if a "4-way handshake" protocol is applied (see also Use_Cases_Discussion#Media_Delivery_UC, HTTP_implementation, and 4-way handshake drawing), however this method is not delivering complete resources, it generates an infinite number of resources to create the control section of the transmitted fragments, and extra care is needed when fetching different part to avoid fetching data from changing resources. Those new resources containing the control section of the fragments to be retrieved form another resource needs to be known by _all_ clients, which has a big implementation cost, but has no impact on Caches.
HTTP byte ranges can only be used to request media fragments if these media fragments can be expressed in terms of byte ranges. This restriction implies that media resources should fulfill the following conditions:
- The media fragments can be extracted in the compressed domain.
- No syntax element modifications in the bitstream are needed to perform the extraction.
Not all media formats will be compliant with these two conditions. Hence, we distinguish the following categories:
- The media resource meets the two conditions (i.e., fragments can be extracted in the compressed domain and no syntax element modifications are necessary). In this case, caching media fragments of such media resources is possible using HTTP byte ranges, because their media fragments are addressable in terms of byte ranges.
- Media fragments can be extracted in the compressed domain, but syntax element modifications are required. These media fragments are cachable using HTTP byte ranges on condition that the syntax element modifications are needed in media-headers applying to the whole media resource/fragment. In this case, those media-headers could be sent to the client in the first response of the server, which is a response to a request on a specific resource different from the byte-range content.
- Media fragments cannot be extracted in the compressed domain. In this case, transcoding operations are necessary to extract media fragments. Since these media fragments are not expressible in terms of byte ranges, it is not possible to cache these media fragments using HTTP byte ranges. Note that media formats which enable extracting fragments in the compressed domain, but are not compliant with category 2 (i.e., syntax element modifications are not only applicable to the whole media resource), also belong to this category.
Fragment axis and compressed domain extraction
In order to get more insight into the classification of media formats (see further), we discuss for each fragment axis (i.e., track, temporal, spatial, and name) how extraction in the compressed domain could be realized (requiring syntax element modifications or not).
Whether tracks are supported or not depends on the container format. Since a container format only defines a syntax and does not introduce any compression, it is always possible to describe the structures of a container format. Hence, if a container format allows the encapsulation of multiple tracks, then it is possible to describe the tracks in terms of byte ranges. Examples of such container formats are Ogg and MP4. Note that it is possible that the tracks are multiplexed, implying that a description of one track consists of a list of byte ranges. Also note that the extraction of tracks (and fragments in general) from container formats often introduces the necessity of syntax element modifications in the headers.
If temporal fragments are supported or not is in the first place dependent on the coding format and more specifically how encoding parameters were set. For video coding formats, temporal fragments can be extracted if the video stream provides random access points (i.e., a point that is not dependent on previously encoded video data, typically corresponding to an intra-coded frame) on a regular basis. The same holds true for audio coding formats, i.e., the audio stream needs to be accessed at a point where the decoder can start decoding without the need of previously coded data.
Support for extraction of spatial fragments in the compressed domain depends on the coding format. The coding format must allow to encode spatial regions independently from each other in order to support the extraction of these regions in the compressed domain. Note that there are currently two variants: region extraction and interactive region extraction. In the first case, the regions (i.e., Regions Of Interest, ROI) are known at encoding time and coded independently from each other. In the second case, ROIs are not known at encoding time and can be chosen by a user agent. In this case, the media resource is divided in a number of tiles, each encoded independently from each other. Subsequently, the tiles covering the desired region are extracted from the media resource.
No coding format provides support for named fragments. Hence, we have to consider container formats for this feature. In general, if a container format allows the insertion of metadata describing the named fragments, then the container format supports named fragments. For example, you can include a CMML or TimedText description in an MP4 or Ogg container and interpret this description to extract fragments based on a name.
Media format classification
In order to get a view on which media formats belong to which category, an overview is provided for the media fragments currently described in State_of_the_Art/Codecs and State_of_the_Art/Containers. In the following table, the numbers 1, 2, and 3 correspond to the three categories described in Section #Conditions. The 'X' symbol indicates that the media format does not support a particular fragment axis.
|H.264/MPEG-4 AVC||n/a||1||2||n/a||Spatial fragment extraction is possible with Flexible Macroblock Ordening (FMO)|
|Motion JPEG2000||n/a||1||3||n/a||Spatial fragment extraction is possible in the compressed domain, but syntax element modifications are needed for every frame.|
|MOV||2||n/a||n/a||2||QTText provides named chapters|
|MP4||2||n/a||n/a||2||MPEG-4 TimedText provides named sections|
|3GP||2||n/a||n/a||2||3GPP TimedText provides named sections|
|MPEG-21 FF||2||n/a||n/a||2||MPEG-21 Digital Item Declaration provides named sections|
|OGG||2||n/a||n/a||2||CMML provides named anchor points|
|ASF||2||n/a||n/a||2||Marker objects provide named anchor points|
|FLV||2||n/a||n/a||2||cue points provide named anchor points|
|RMFF||1 or 2(?)||n/a||n/a||?|
|TIFF||2||n/a||n/a||2||Can store multiple images (i.e., tracks) in one file, possibility to insert "private tags" (i.e., proprietary information)|
Next to the byte ranges approach, we could also consider to define new HTTP range units able to descibe a particular fragment axis. For example, a time HTTP range could be defined pointing to temporal fragments in terms of time. This approach has the following advantages and disadvantages over retreiving data through the HTTP byte ranges approach listed above
- Representation of fragments is independent of the byte addressing, implying that the fragment extraction method (i.e., compressed domain extraction or transcoding) does not introduce caching restrictions.
- The four-way handshake protocol can be avoided.
- No extra-resources are created
- No need to ensure that the resource did not change when fetching data
- Complete fragment resources are generated, including control sections
- This approach will not work with the current HTTP proxies and caches infrastructure, as unknown range units will be ignored and the response not cached.
- Cache needs more knowledge regarding the media format. This is necessary to be able to join media fragments. To be able to join media fragments, the cache needs a perfect mapping of the bytes and timestamps for each fragment. Furthermore, if we want to keep the cache format-independent, such a mapping is not enough. We also need information regarding the byte positions of random access points and their corresponding timestamps (computed by the cache, or given by the server). This way, a cache can determine which parts are overlapping when joining two fragments. Note that this kind of information could be modeled by a codec-independent resource description format. Of course, this works only when it is possible to extract the media fragments in the compressed domain. For joining fragments which are the result of transcoding operations, transcoders will be needed in the cache.
Caching media fragments based on axis-specific range headers looks promising but comes with an implementation cost. It requires 'specialized media caches', having the necessary knowledge of new HTTP ranges to interpret and join media fragments.