Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please explain how to compute initial_presentation_delay from the bitstream #101

Open
ocrete opened this issue Sep 4, 2018 · 36 comments
Open
Labels
editorial if approved, it does not affect the technical aspects of the specification (e.g. typo change)

Comments

@ocrete
Copy link
Contributor

ocrete commented Sep 4, 2018

The current definition of the initial_presentation_delay is "interesting" if you want to build a decoder, but it's completely non-trivial to figure out what value to put there when doing a muxer. When trying to produce it in GStreamer, I'm completely lost. I noticed that neither ffmpeg nor libaom has any code that implements it.

Can we please, either:

  • Update libaom to give that value explictly
  • Include a procedure/algorithm to calculate this value from the sequence header
@tomfinegan
Copy link
Collaborator

The current definition of the initial_presentation_delay is "interesting" if you want to build a decoder, but it's completely non-trivial to figure out what value to put there when doing a muxer. When trying to produce it in GStreamer, I'm completely lost. I noticed that neither ffmpeg nor libaom has any code that implements it.

I agree that the initial_presentation_delay_minus_one section is fairly complex, but I think the example included makes the necessary steps pretty clear. Can you elaborate on where you're getting lost?

Update libaom to give that value explictly

That could potentially be done, but this is not the appropriate issue tracker for such a feature request. Please file an issue in the AOM issue tracker: https://bugs.chromium.org/p/aomedia/issues/entry?template=Feature+Request.

Adding support to libaom for returning the value in terms of samples (aka Temporal Units) would not help in a remux. I don't think adding it to libaom really solves the problem for muxers.

Include a procedure/algorithm to calculate this value from the sequence header

It's not possible to calculate given only a sequence header OBU. You need the initial_presentation_delay_minus_one value to know the number of frames, and then the (de)muxer must calculate the value in samples it by counting the number of frames in each Temporal Unit.

  • For Low Overhead Bitstream Format TUs frame counts can be determined by counting the number of Frame and Frame Header OBUs in each Temporal Unit.
  • For Annex B TUs it can be determined by counting the number of frame_unit_size values in each TU.

When initial_presentation_delay_minus_one is not present in the Sequence Header OBU, the value in frames is assumed to be 10. A (de)muxer would then follow the same procedure.

@robUx4
Copy link

robUx4 commented Sep 11, 2018

Technically the encoder should know this value if it fills the initial_display_delay_minus_1 field. It also knows how many frames it would pack in a single TU, which only happen in some cases with some time contraints (an encoder is probably never going to provide 2 S-frames 10 frames apart ?).

If not provided by the encoder it will be very tricky to fill the value properly without parsing the whole stream, so will only be known a posteriori.

Maybe in practice it will always be initial_display_delay_minus_1 + 1 ?

@robUx4
Copy link

robUx4 commented Sep 11, 2018

BTW, I don't see the mention of a default to 10 in the ISOBMFF document.

@ocrete
Copy link
Contributor Author

ocrete commented Sep 11, 2018

If I understand from the example in the spec, this value is the same as initial_display_delay_minus_1 but counted in TUs instead of in frames ?

Can we add something along the lines of this to the document:

To compute initial_presentation_delay_minus_one from a stream, one must:

  1. Iterate over the entire stream and count the number of frames in each TU and get the maximum (let name this max_frames_in_one_tu)
  2. Read the value of initial_display_delay_minus_1 from the sequence header (I assume there can't be more than one sequence header per track in ISOMBPFF ?)
  3. Apply the following formula: initial_display_delay_minus_1 = (????) - 1;

Also, I wonder if it would make sense to extend the AV1 bitstream to add this into a metadata block or to extend the Sequence Header somehow, as this value seems like some useful even outside of MP4 files, it seems like having this as a MP4 specific information sounds like a workaround for a design gap in the bitstream.

@ocrete
Copy link
Contributor Author

ocrete commented Sep 11, 2018

Maybe in practice it will always be initial_display_delay_minus_1 + 1 ?

If that is true, maybe we can just drop this initial_presentation_delay from the MP4 header entirely?

@VFR-maniac
Copy link

When initial_presentation_delay_minus_one is not present in the Sequence Header OBU, the value in frames is assumed to be 10. A (de)muxer would then follow the same procedure.

The value 10 comes from BufferPoolMaxSize (=10)? But why 10 not 9 instead. I think the decoder can't hold frames more than 10 frames so the decoder can't hold more than 10 temporal units so initial_presentation_delay_minus_one + 1 <= 10.

I'm also wondering how initial_presentation_delay_minus_one affects presentation delay on the presentation timeline. The spec of AV1-in-ISOBMFF specifies no ctts box. This means, in generally, Decoding Time == Composition Time and the decoder takes a sample at the Decoding Time then the decoder output a decoded something with the Composition Time (== Decoding Time) unless there is compositionToDTSShift > 0, and there is no delay in units of ISOBMFF sample (here, correspond to AV1 sample and Temporal Unit). But initial_presentation_delay_minus_one says there may be delay in units of ISOBMFF sample. It's really confusing. I'm reading the spec of AV1 but I don't get why there is decoder delay in units of timestamped access unit, it's just like the packed bitstream which was a popular hack used for B-VOPs-in-AVI for VfW and no delay to output decoded frame at the decoder side. If no composition time offsets, it is strange that initial_presentation_delay_minus_one is present.

@cconcolato
Copy link
Collaborator

FYI, I just filed a feature request to aomenc, see https://bugs.chromium.org/p/aomedia/issues/detail?id=2150

@tomfinegan
Copy link
Collaborator

tomfinegan commented Sep 11, 2018

Technically the encoder should know this value if it fills the initial_display_delay_minus_1 field. It also knows how many frames it would pack in a single TU, which only happen in some cases with some time contraints (an encoder is probably never going to provide 2 S-frames 10 frames apart ?).

It doesn't matter if an encoder knows it or not when re-muxing AV1.

If not provided by the encoder it will be very tricky to fill the value properly without parsing the whole stream, so will only be known a posteriori.

Why do you think you need the entire bitstream? You need enough TUs (samples) to count 10 frames.

Maybe in practice it will always be initial_display_delay_minus_1 + 1 ?

Where is this calculation coming from?

BTW, I don't see the mention of a default to 10 in the ISOBMFF document.

It's from the AV1 spec[1]. It's the value of BufferPoolMaxSize.

edit: forgot this link:
[1] https://aomediacodec.github.io/av1-spec/av1-spec.pdf#page=661

@tomfinegan
Copy link
Collaborator

The value 10 comes from BufferPoolMaxSize (=10)? But why 10 not 9 instead. I think the decoder can't hold frames more than 10 frames so the decoder can't hold more than 10 temporal units so initial_presentation_delay_minus_one + 1 <= 10.

I was including the +1, but I should have been more clear. Sorry.

@cconcolato
Copy link
Collaborator

cconcolato commented Sep 11, 2018

If not provided by the encoder it will be very tricky to fill the value properly without parsing the whole stream, so will only be known a posteriori.

I agree. It should be possible to determine the value a posteriori. I had started doing that, see https://github.com/cconcolato/av1_decoder_model. You could input an initial_display_delay and check if the bitstream was valid according to the decoder model. It has not been updated in a while though but if anyone is interested, feel free to suggest updates.

Maybe in practice it will always be initial_display_delay_minus_1 + 1 ?

It depends on the number of alt-ref images but maybe.

To compute initial_presentation_delay_minus_one from a stream, one must:

I don't think this algorithm works. Again, you have to run the decoder model and see what minimum initial_display_delay validates the model.

@cconcolato
Copy link
Collaborator

I'm also wondering how initial_presentation_delay_minus_one affects presentation delay on the presentation timeline. The spec of AV1-in-ISOBMFF specifies no ctts box. This means, in generally, Decoding Time == Composition Time and the decoder takes a sample at the Decoding Time then the decoder output a decoded something with the Composition Time (== Decoding Time) unless there is compositionToDTSShift > 0, and there is no delay in units of ISOBMFF sample (here, correspond to AV1 sample and Temporal Unit).

As you mentioned, composition offsets are not used so you cannot use compositionToDTSShift > 0.

But initial_presentation_delay_minus_one says there may be delay in units of ISOBMFF sample. It's really confusing.

Sorry for that. There is no composition offset because if you feed TUs to a decoder it will produce as many output frames as input TUs and with the same presentation order as the decoding order. If you assume instantaneous decoding (as usual in ISOBMFF), this means CTS = DTS.

The initial_presentation_delay concept is introduced to cope with problems happening in real implementation. When your decoder operates at the decoding speed limit of a level, if you don't wait to fill some reference buffers before starting to display, you may experience smoothness issues. The delay tells you how long your player should wait. If no information is provided, an AV1 decoder should wait for 10 frames to be decoded, but for some bitstreams you may need less than that.

I'm reading the spec of AV1 but I don't get why there is decoder delay in units of timestamped access unit,

I'm not sure what you mean by "in units of timestamped access unit". There is a delay, mostly because of 'show_frame = 0'. The delay at the elementary stream level is expressed in number of decoded frames. At the ISOBMFF level, it is express in number of decoded samples, because a player may not have access to the internals of a decoder to know how many frames were decoded when a TU is passed.

If no composition time offsets, it is strange that initial_presentation_delay_minus_one is present.

Hope I clarified.

@ocrete
Copy link
Contributor Author

ocrete commented Sep 11, 2018

I don't think this algorithm works. Again, you have to run the decoder model and see what minimum initial_display_delay validates the model.

Can you please give me some pseudo-code or algorithms to compute it? No theoretical decoders that can't fail please, just a real algorithm that a stupid programmer like myself can implement.

The part I don't understand is why counting 10 frames is enough? Does this delay only apply to the first 10 frames? What if there is a bigger grouping later, is that forbidden by the AV1 spec?

@cconcolato
Copy link
Collaborator

Can you please give me some pseudo-code or algorithms to compute it? No theoretical decoders that can't fail please, just a real algorithm that a stupid programmer like myself can implement.

Unfortunately, that has not been done yet ...

The part I don't understand is why counting 10 frames is enough? Does this delay only apply to the first 10 frames? What if there is a bigger grouping later, is that forbidden by the AV1 spec?

That's the upper bound according to the AV1 spec. If you decode 10 frames before presenting the first one, you are guaranteed to be able to present the bitstream smoothly (if the decoder operates at the decoding speed (or faster) given by the level definition)

@agrange
Copy link

agrange commented Sep 11, 2018 via email

@ocrete
Copy link
Contributor Author

ocrete commented Sep 12, 2018

That all makes sense. The part that I don't get is why looking at the first 10 frames of a stream is enough? Isn't it possible to have 10 frames with show_frame=1, then have the next 8 with show_frame=0, or have 1000 showed frames before the first non-showed one. So to compute the value of the initial delay (ie, the size of the dpb), we'd need to parse every frame in the stream since the value in the sequence header is not useful for muxing. But Tom says that parsing the first 10 frames is enough?

@cconcolato
Copy link
Collaborator

A muxer has to either trust the encoder to give the value or run the analysis on the entire stream.
10 is a upper boundary of the value you will find.

@tomfinegan
Copy link
Collaborator

tomfinegan commented Sep 12, 2018

But Tom says that parsing the first 10 frames is enough?

I was referring to Temporal Units or samples, not frames, since samples can contain multiple frames.

What I said was that to calculate the value a (de)muxer could count the frames in samples that begin the stream. Since 10 is:

  • The maximum value in frames for initial_display_delay_minus_1 + 1.
  • The assumed default value for initial_display_delay_minus_1 when it is not present in the Sequence Header OBU.

A (de)muxer can calculate a value in samples by counting frames in Temporal Units. When the value is not present in the Sequence Header OBU a (de)muxer would use the value 10.

Whether it is present in the Sequence Header OBU or not, a muxer would count the frames in Temporal Units until it processes the TU where frames == initial_display_delay_minus_1 + 1, and then set initial_presentation_delay_minus_one = number of TUs - 1.

edit: added '+ 1' to the first bullet point

@robUx4
Copy link

robUx4 commented Sep 13, 2018

A muxer has to either trust the encoder to give the value or run the analysis on the entire stream.
10 is a upper boundary of the value you will find.

As @agrange noted, it could theoretically extend beyond 10.

I still think this information can be provided by the encoder. A posteriori in all cases, and a priori if it knows exactly how it can manege not showable frames.

It's true that remuxing partially a file may change the value needed to have smooth playback if less are needed for that part (it can never be more). Does it make the remuxed file invalid if it claims a value for initial_presentation_delay_minus_one but it's actually less (and the goal of that value is to find our when it's less than 10) ? Or should we either void the value/presence on remux or reparse all OBUs to get the proper value ?

@lu-zero lu-zero mentioned this issue Sep 13, 2018
16 tasks
@agrange
Copy link

agrange commented Sep 13, 2018 via email

@cconcolato
Copy link
Collaborator

As @agrange noted, it could theoretically extend beyond 10.

I don't think that's what @agrange said. AV1 limits the number of buffers to 10, so the 11th frame could not be decoded. @agrange can you clarify?

I still think this information can be provided by the encoder.

I agree.

A posteriori in all cases, and a priori if it knows exactly how it can manege not showable frames.

Yes, but a posteriori it could also be another tool (muxer or else) but this requires running the decoder model.

Does it make the remuxed file invalid if it claims a value for initial_presentation_delay_minus_one but it's actually less (and the goal of that value is to find our when it's less than 10) ?

No. However, you have to be careful that if you splice two streams that have different values, you either have to use the larger one or not give the value.

@agrange
Copy link

agrange commented Sep 13, 2018 via email

@VFR-maniac
Copy link

@cconcolato I still don't get from your explanation.

At the first, clarify the definition of the composition time in AV1-in-ISOBMFF. The absence of the Composition Time to Sample Box does not mean the absence of the definition and/or the concept of the composition time applied to AV1-in-ISOBMFF. I can see there is no concept of the composition time from your explanation.

Personally, I really dislike this indication of the delay outside the common time structure in the scope of the ISOBMFF. Why not just adding the Composition Time to Sample Box consisting of only one entry which indicates the presentation delay time as the sample_offset, instead.

I'm also wondering how to treat when the edit list is applied. The media_time=0 specify the presentation of the AV1 track starts from time=0 on the media timeline but the presentation is delayed after the time at the (initial_presentation_delay_minus_one+1)-th AV1 sample?

@VFR-maniac
Copy link

I'm not sure what you mean by "in units of timestamped access unit". There is a delay, mostly because of 'show_frame = 0'. The delay at the elementary stream level is expressed in number of decoded frames. At the ISOBMFF level, it is express in number of decoded samples, because a player may not have access to the internals of a decoder to know how many frames were decoded when a TU is passed.

I don't get this part at all. I can understand there is a delay in frame level. But I can't understand there is a delay in TU level. As far as I understand, a TU is a gathering of frames delimited by timestamp which can be assigned to an output frame. This mean the decoder takes a TU with timestamp T, then the decoder can output a shown frame with T without waiting the next TU. The spec of AV1 says Each temporal unit must have exactly one shown frame.. So, I strongly think the decoder take a TU then the decoder can output a frame smoothly. What am I wrong? Or to output the first frame after the TU0, that frame could depend on TU1 or later TU?

@cconcolato
Copy link
Collaborator

a TU is a gathering of frames delimited by timestamp

Almost. I would say delimited by a temporal delimiter in the input bitstream, but they are associated with the same timestamp.

which can be assigned to an output frame

Only one of the frames in the TU wil produce an output frame.

This mean the decoder takes a TU with timestamp T, then the decoder can output a shown frame with T without waiting the next TU.

Yes.

I strongly think the decoder take a TU then the decoder can output a frame smoothly.

If you take out the word "smoothly", yes. Given a TU, a decoder can always output a frame.
The problem is that a decoder cannot always decode the TU in the time during which the previous frame has to be presented (assuming fixed frame rate for simplification here). Because the TU may contain multiple frames.

Or to output the first frame after the TU0, that frame could depend on TU1 or later TU?

No. A TU has no dependency on future TU.

@agrange
Copy link

agrange commented Sep 13, 2018 via email

@cconcolato
Copy link
Collaborator

At the first, clarify the definition of the composition time in AV1-in-ISOBMFF. The absence of the Composition Time to Sample Box does not mean the absence of the definition and/or the concept of the composition time applied to AV1-in-ISOBMFF. I can see there is no concept of the composition time from your explanation.

I'm not sure what the question is here.

Personally, I really dislike this indication of the delay outside the common time structure in the scope of the ISOBMFF. Why not just adding the Composition Time to Sample Box consisting of only one entry which indicates the presentation delay time as the sample_offset, instead.

We could have used the ctts box (although not with a single entry) but it introduces lots of complexity (requires an edit list for AV-sync or negative CTS offsets ...). I strongly believe the chosen approach is simpler: players can ignore initial_presentation_delay and muxers are only required to put a value if it is correct, otherwise they can omit the value. initial_presentation_delay is only an indication/hint for players if they want to reduce the playback latency.

I'm also wondering how to treat when the edit list is applied. The media_time=0 specify the presentation of the AV1 track starts from time=0 on the media timeline but the presentation is delayed after the time at the (initial_presentation_delay_minus_one+1)-th AV1 sample?

The initial_presentation_delay does not affect composition or decode times. So there is no impact. Edit lists are applied as usual.

@robUx4
Copy link

robUx4 commented Sep 14, 2018

I interpreted Steve's comment to mean that, as per the example I provided, we can conceive of a GOP structure that requires more than 10-frame buffers. Whilst AV1 restricts signaling of the delay to 4-bits, so 16 frames, a decoder that only provides the minimal 10 frame buffers that AV1 mandates would not be able to decode the 11th frame until one of the 10 available frame buffers becomes free, presumably as the result of a display event. A decoder / application may choose to provide a larger number of frame buffers, 100 say, which would allow it to decode way beyond the 10th frame before displaying any frame. But a compliant bitstream cannot rely on that over-provisioning. And we are still able to signal a maximum delay of 16 frames.

I think I understand the nuance now. A compliant decoder should only cache 10 frames at most. So even if the TU contains 100 frames the decoder will still only have 10 max frames in its cache. So it can never be more than 10 (minus/plus 1 depending on how you count).

@agrange
Copy link

agrange commented Sep 14, 2018 via email

@jeeb
Copy link

jeeb commented Sep 14, 2018

It is really unfortunate that this discussion spawned around/after the v1 of the specification got "frozen", and this is probably partially because people only start implementing something when they know that it will not wildly change any more. But this is what it is.

This whole value seems like something that should be a header flag in the AV1 bit stream a la max_decoder_latency/max_decoder_buffer_required rather than something that should be in the container... You can always replicate the field in the container if you really want to (the HEVC-in-ISOBMFF specification writers seemed to think so), but if the flag for the first time appears on the container level and there's nothing on the bit stream level to read it from, then it becomes a parsing nightmare if you need to know the maximum reorder delay throughout the full stream to fill that value.

Also one would think that things like this could be handled on the SW side of the hwdec implementation, where it would just return "feed me more" until it can actually return the following coded image in PTS/CTS order. As I think this is how hwdec generally works for AVC/HEVC? Given that required header/initialization values are available to the actual decoder/parser that feeds to the hwdec implementation, of course.

... muxers are only required to put a value if it is correct, otherwise they can omit the value. initial_presentation_delay is only an indication/hint for players if they want to reduce the playback latency.

So do I understand it correctly that writing this value at all is 100% voluntary and that there is a boolean somewhere to mention if you could come up with a value for this field or not? Or do you mean writing the default (10 buffered frames) as "omit"?

@cconcolato
Copy link
Collaborator

So do I understand it correctly that writing this value at all is 100% voluntary and that there is a boolean somewhere to mention if you could come up with a value for this field or not?

Yes

@VFR-maniac
Copy link

If you take out the word "smoothly", yes. Given a TU, a decoder can always output a frame.
The problem is that a decoder cannot always decode the TU in the time during which the previous frame has to be presented (assuming fixed frame rate for simplification here). Because the TU may contain multiple frames.

That is just a composition time offset at the TU where the decoder requires more TUs, isn't that?

We could have used the ctts box (although not with a single entry) but it introduces lots of complexity (requires an edit list for AV-sync or negative CTS offsets ...). I strongly believe the chosen approach is simpler: players can ignore initial_presentation_delay and muxers are only required to put a value if it is correct, otherwise they can omit the value. initial_presentation_delay is only an indication/hint for players if they want to reduce the playback latency.

I don't think this approach makes the issue simpler. If you really want to avoid negative CTS offsets it is enough that the spec forbits it. Also I don't think the edit list is a complex thing. You say it is hint for players, but if players ignore it even if initial_presentation_delay is present, there is possibly jerkiness, isn't there? So I think it is not an ignorable thing and a hint for almost all players. This is a similar thing the demuxer or player don't know the edit list and may introduce AV-async. The initial_presentation_delay only makes sense for AV1. To support AV1 in ISOBMFF, the muxer and demuxer need support initial_presentation_delay in addition to the decoder initialization record as the minimum implementation. I believe that container file formats should hides and opacify the CODEC specific properties as much as possible to treat any encapsulated CODECs by the common ways. If the initial_presentation_delay is defiend in the spec of ISOBMFF, I don't give such unpleasant words. :(

@robUx4
Copy link

robUx4 commented Sep 17, 2018

Also one would think that things like this could be handled on the SW side of the hwdec implementation, where it would just return "feed me more" until it can actually return the following coded image in PTS/CTS order. As I think this is how hwdec generally works for AVC/HEVC? Given that required header/initialization values are available to the actual decoder/parser that feeds to the hwdec implementation, of course.

The issue here is that a TU (Sample in ISOBMFF/Block in Matroska) may contain more than one frame to decode, invisible to the container. But it's not necessarily at the beginning of the stream. The Sequence Header OBU may have the information but as frames, not TU. And it doesn't say how many frames can be packed in a TU during the whole Sequence it describes. So in any case, we cannot have this information from the decoder.

@cconcolato
Copy link
Collaborator

For context, a related issue in dav1d: https://code.videolan.org/videolan/dav1d/-/issues/406

@tdaede
Copy link

tdaede commented Oct 3, 2022

Is it always safe to copy initial_display_delay from the sequence header to initial_presentation_delay in ISOBMFF? Looking at the definitions, although one is per frame and one is per sample, I cannot think of a case where copying the value violates the ISOBMFF wording, because every sample is guaranteed to output a frame.

(If so, at least for some encoders, producing initial_display_delay values is trivial and would make for an easy conformance bitstream)

@cconcolato cconcolato added the editorial if approved, it does not affect the technical aspects of the specification (e.g. typo change) label Jan 23, 2023
@cconcolato
Copy link
Collaborator

We should revisit this issue once we have conformance streams exercising the feature.

@cconcolato
Copy link
Collaborator

We intend to close this issue when conformance files are provided (#180)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
editorial if approved, it does not affect the technical aspects of the specification (e.g. typo change)
Projects
None yet
Development

No branches or pull requests

8 participants