-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please explain how to compute initial_presentation_delay from the bitstream #101
Comments
I agree that the
That could potentially be done, but this is not the appropriate issue tracker for such a feature request. Please file an issue in the AOM issue tracker: https://bugs.chromium.org/p/aomedia/issues/entry?template=Feature+Request. Adding support to libaom for returning the value in terms of samples (aka Temporal Units) would not help in a remux. I don't think adding it to libaom really solves the problem for muxers.
It's not possible to calculate given only a sequence header OBU. You need the
When |
Technically the encoder should know this value if it fills the If not provided by the encoder it will be very tricky to fill the value properly without parsing the whole stream, so will only be known a posteriori. Maybe in practice it will always be |
BTW, I don't see the mention of a default to 10 in the ISOBMFF document. |
If I understand from the example in the spec, this value is the same as initial_display_delay_minus_1 but counted in TUs instead of in frames ? Can we add something along the lines of this to the document: To compute initial_presentation_delay_minus_one from a stream, one must:
Also, I wonder if it would make sense to extend the AV1 bitstream to add this into a metadata block or to extend the Sequence Header somehow, as this value seems like some useful even outside of MP4 files, it seems like having this as a MP4 specific information sounds like a workaround for a design gap in the bitstream. |
If that is true, maybe we can just drop this initial_presentation_delay from the MP4 header entirely? |
The value 10 comes from BufferPoolMaxSize (=10)? But why 10 not 9 instead. I think the decoder can't hold frames more than 10 frames so the decoder can't hold more than 10 temporal units so initial_presentation_delay_minus_one + 1 <= 10. I'm also wondering how initial_presentation_delay_minus_one affects presentation delay on the presentation timeline. The spec of AV1-in-ISOBMFF specifies no ctts box. This means, in generally, Decoding Time == Composition Time and the decoder takes a sample at the Decoding Time then the decoder output a decoded something with the Composition Time (== Decoding Time) unless there is compositionToDTSShift > 0, and there is no delay in units of ISOBMFF sample (here, correspond to AV1 sample and Temporal Unit). But initial_presentation_delay_minus_one says there may be delay in units of ISOBMFF sample. It's really confusing. I'm reading the spec of AV1 but I don't get why there is decoder delay in units of timestamped access unit, it's just like the packed bitstream which was a popular hack used for B-VOPs-in-AVI for VfW and no delay to output decoded frame at the decoder side. If no composition time offsets, it is strange that initial_presentation_delay_minus_one is present. |
FYI, I just filed a feature request to aomenc, see https://bugs.chromium.org/p/aomedia/issues/detail?id=2150 |
It doesn't matter if an encoder knows it or not when re-muxing AV1.
Why do you think you need the entire bitstream? You need enough TUs (samples) to count 10 frames.
Where is this calculation coming from?
It's from the AV1 spec[1]. It's the value of BufferPoolMaxSize. edit: forgot this link: |
I was including the +1, but I should have been more clear. Sorry. |
I agree. It should be possible to determine the value a posteriori. I had started doing that, see https://github.com/cconcolato/av1_decoder_model. You could input an initial_display_delay and check if the bitstream was valid according to the decoder model. It has not been updated in a while though but if anyone is interested, feel free to suggest updates.
It depends on the number of alt-ref images but maybe.
I don't think this algorithm works. Again, you have to run the decoder model and see what minimum initial_display_delay validates the model. |
As you mentioned, composition offsets are not used so you cannot use compositionToDTSShift > 0.
Sorry for that. There is no composition offset because if you feed TUs to a decoder it will produce as many output frames as input TUs and with the same presentation order as the decoding order. If you assume instantaneous decoding (as usual in ISOBMFF), this means CTS = DTS. The initial_presentation_delay concept is introduced to cope with problems happening in real implementation. When your decoder operates at the decoding speed limit of a level, if you don't wait to fill some reference buffers before starting to display, you may experience smoothness issues. The delay tells you how long your player should wait. If no information is provided, an AV1 decoder should wait for 10 frames to be decoded, but for some bitstreams you may need less than that.
I'm not sure what you mean by "in units of timestamped access unit". There is a delay, mostly because of 'show_frame = 0'. The delay at the elementary stream level is expressed in number of decoded frames. At the ISOBMFF level, it is express in number of decoded samples, because a player may not have access to the internals of a decoder to know how many frames were decoded when a TU is passed.
Hope I clarified. |
Can you please give me some pseudo-code or algorithms to compute it? No theoretical decoders that can't fail please, just a real algorithm that a stupid programmer like myself can implement. The part I don't understand is why counting 10 frames is enough? Does this delay only apply to the first 10 frames? What if there is a bigger grouping later, is that forbidden by the AV1 spec? |
Unfortunately, that has not been done yet ...
That's the upper bound according to the AV1 spec. If you decode 10 frames before presenting the first one, you are guaranteed to be able to present the bitstream smoothly (if the decoder operates at the decoding speed (or faster) given by the level definition) |
Maybe a little background on the need for initial_display_dela would help?
The problem arises due to the concept of hidden frames, being defined as
frames with show_frame = 0 in the frame header.
Think about a (ridiculous!) worst case - unlikely to be useful in practice
- for example:
An encoder produces a first temporal unit (TU0) containing a first
keyframe. It then produces a second temporal unit (TU1) consisting of a
large number of frames, say 101, the first 100 of which are hidden frames
(show_frame = 0), the last one being showable (show_frame = 1). The decoder
first decodes TU0 to produce the keyframe, then it decodes TU1 to produce
101 frames only the last one is showable. Now, if the keyframe is displayed
as soon as it is decoded, it is likely that the 2nd displayable frame will
not have been decoded in time for display. Because the decoder has to
decode 100 additional frames between the two displayable frames. Thus,
playback will not be smooth. Of course, if your decoder runs 100 times
faster than that required to satisfy the AV1 level-defined sample
throughput criteria then the decoder may be able to keep up, but all we can
assume in general is that the decoder just meets the minimum performance
criteria specified by the signaled level.
In practice hidden frames are used more conservatively, to implement a
pyramid coding structure for example, and the resulting GOP structure might
need a maximum of 4-5 hidden frames in a single TU. In these cases we can
compute a minimum time period that the display of the first frame should be
delayed to ensure smooth playback for the entire stream,
initial_display_delay, which we express in terms of the number of frames
that are required to be decoded before display of the first frame.
One might think that this delay could be at most 9 frames, being the 8
reference buffer slots, plus a buffer to hold the frame currently being
decoded. However, we routinely use GOPs that require a 10 frame delay. As
seen in the ridiculous example above, the delay can "theoretically" extend
beyond 10, but this was deemed to be a sensible compromise.
Hope this helps.
Regards,
Adrian
…On Tue, Sep 11, 2018 at 2:06 PM, Olivier Crête ***@***.***> wrote:
I don't think this algorithm works. Again, you have to run the decoder
model and see what minimum initial_display_delay validates the model.
Can you please give me some pseudo-code or algorithms to compute it? No
theoretical decoders that can't fail please, just a real algorithm that a
stupid programmer like myself can implement.
The part I don't understand is why counting 10 frames is enough? Does this
delay only apply to the first 10 frames? What if there is a bigger grouping
later, is that forbidden by the AV1 spec?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#101 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAQoPsCYyWllXbnemO59CE40Cyw8swrFks5uaCXEgaJpZM4WZjCV>
.
|
That all makes sense. The part that I don't get is why looking at the first 10 frames of a stream is enough? Isn't it possible to have 10 frames with show_frame=1, then have the next 8 with show_frame=0, or have 1000 showed frames before the first non-showed one. So to compute the value of the initial delay (ie, the size of the dpb), we'd need to parse every frame in the stream since the value in the sequence header is not useful for muxing. But Tom says that parsing the first 10 frames is enough? |
A muxer has to either trust the encoder to give the value or run the analysis on the entire stream. |
I was referring to Temporal Units or samples, not frames, since samples can contain multiple frames. What I said was that to calculate the value a (de)muxer could count the frames in samples that begin the stream. Since 10 is:
A (de)muxer can calculate a value in samples by counting frames in Temporal Units. When the value is not present in the Sequence Header OBU a (de)muxer would use the value 10. Whether it is present in the Sequence Header OBU or not, a muxer would count the frames in Temporal Units until it processes the TU where edit: added '+ 1' to the first bullet point |
As @agrange noted, it could theoretically extend beyond 10. I still think this information can be provided by the encoder. A posteriori in all cases, and a priori if it knows exactly how it can manege not showable frames. It's true that remuxing partially a file may change the value needed to have smooth playback if less are needed for that part (it can never be more). Does it make the remuxed file invalid if it claims a value for |
Olivier:
Determining he value of initial_display_delay may need to be based on the
analysis of more than the first 10 frames, see my example. In practice, the
encoder is responsible for putting the correct number in the sequence
header, and would either know or calculate based on the GOP structure it
uses. (Note: The spec uses 4 bits to signal initial_display_delay_minus_1
so we can signal values up to 16).
If a middlebox remuxes the stream - to extract different operating points
for example - then the encoder needs to ensure that the value specified is
the worst case for all the operating points.
Steve:
initial_display_delay is only a (strong) recommendation to the application
that if it starts displaying frames too early then it may encounter
problems later. Values that are too bigger than the optimal value just mean
that the application may be over-cautious and introduce an unnecessary
delay in startup time. Specifying a value that is too small would run the
risk of disrupted playback. The stream would still be valid either way.
Adrian
…On Thu, Sep 13, 2018 at 2:24 AM, Steve Lhomme ***@***.***> wrote:
A muxer has to either trust the encoder to give the value or run the
analysis on the entire stream.
10 is a upper boundary of the value you will find.
As @agrange <https://github.com/agrange> noted, it could theoretically
extend beyond 10.
I still think this information can be provided by the encoder. A
posteriori in all cases, and a priori if it knows exactly how it can manege
not showable frames.
It's true that remuxing partially a file may change the value needed to
have smooth playback if less are needed for that part (it can never be
more). Does it make the remuxed file invalid if it claims a value for
initial_presentation_delay_minus_one but it's actually less (and the goal
of that value is to find our when it's less than 10) ? Or should we either
void the value/presence on remux or reparse all OBUs to get the proper
value ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#101 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAQoPpnPxHcceh9rRGRhyiaEKEtmW4ZOks5uaiQ4gaJpZM4WZjCV>
.
|
I don't think that's what @agrange said. AV1 limits the number of buffers to 10, so the 11th frame could not be decoded. @agrange can you clarify?
I agree.
Yes, but a posteriori it could also be another tool (muxer or else) but this requires running the decoder model.
No. However, you have to be careful that if you splice two streams that have different values, you either have to use the larger one or not give the value. |
As @agrange <https://github.com/agrange> noted, it could theoretically
extend beyond 10.
I don't think that's what @agrange <https://github.com/agrange> said. AV1
limits the number of buffers to 10, so the 11th frame could not be decoded.
@agrange <https://github.com/agrange> can you clarify?
I interpreted Steve's comment to mean that, as per the example I provided,
we can conceive of a GOP structure that requires more than 10-frame
buffers. Whilst AV1 restricts signaling of the delay to 4-bits, so 16
frames, a decoder that only provides the minimal 10 frame buffers that AV1
mandates would not be able to decode the 11th frame until one of the 10
available frame buffers becomes free, presumably as the result of a display
event.
A decoder / application may choose to provide a larger number of frame
buffers, 100 say, which would allow it to decode way beyond the 10th frame
before displaying any frame. But a compliant bitstream cannot rely on that
over-provisioning. And we are still able to signal a maximum delay of 16
frames.
|
@cconcolato I still don't get from your explanation. At the first, clarify the definition of the composition time in AV1-in-ISOBMFF. The absence of the Composition Time to Sample Box does not mean the absence of the definition and/or the concept of the composition time applied to AV1-in-ISOBMFF. I can see there is no concept of the composition time from your explanation. Personally, I really dislike this indication of the delay outside the common time structure in the scope of the ISOBMFF. Why not just adding the Composition Time to Sample Box consisting of only one entry which indicates the presentation delay time as the sample_offset, instead. I'm also wondering how to treat when the edit list is applied. The media_time=0 specify the presentation of the AV1 track starts from time=0 on the media timeline but the presentation is delayed after the time at the (initial_presentation_delay_minus_one+1)-th AV1 sample? |
I don't get this part at all. I can understand there is a delay in frame level. But I can't understand there is a delay in TU level. As far as I understand, a TU is a gathering of frames delimited by timestamp which can be assigned to an output frame. This mean the decoder takes a TU with timestamp T, then the decoder can output a shown frame with T without waiting the next TU. The spec of AV1 says |
Almost. I would say delimited by a temporal delimiter in the input bitstream, but they are associated with the same timestamp.
Only one of the frames in the TU wil produce an output frame.
Yes.
If you take out the word "smoothly", yes. Given a TU, a decoder can always output a frame.
No. A TU has no dependency on future TU. |
All that we're saying here is:
- the decoder has to wait for N = initial_display_delay (defined in the
AV1 bitstream spec) frames to be decoded before displaying the first
decoded frame
- those N frames may be contained in K (as set in the isobmff container)
temporal units
- The process that wraps the bitstream in the container has to work out K
…On Thu, Sep 13, 2018 at 4:15 PM, Yusuke Nakamura ***@***.***> wrote:
I'm not sure what you mean by "in units of timestamped access unit". There
is a delay, mostly because of 'show_frame = 0'. The delay at the elementary
stream level is expressed in number of decoded frames. At the ISOBMFF
level, it is express in number of decoded samples, because a player may not
have access to the internals of a decoder to know how many frames were
decoded when a TU is passed.
I don't get this part at all. I can understand there is a delay in frame
level. But I can't understand there is a delay in TU level. As far as I
understand, a TU is a gathering of frames delimited by timestamp which can
be assigned to an output frame. This mean the decoder takes a TU with
timestamp T, then the decoder can output a shown frame with T without
waiting the next TU. The spec of AV1 says Each temporal unit must have
exactly one shown frame.. So, I strongly think the decoder take a TU then
the decoder can output a frame smoothly. What am I wrong? Or to output the
first frame after the TU0, that frame could depend on TU1 or later TU?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#101 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAQoPsj6EJZf0_K_aqa1c1YGtpYlM-BZks5uaucZgaJpZM4WZjCV>
.
|
I'm not sure what the question is here.
We could have used the
The |
I think I understand the nuance now. A compliant decoder should only cache 10 frames at most. So even if the TU contains 100 frames the decoder will still only have 10 max frames in its cache. So it can never be more than 10 (minus/plus 1 depending on how you count). |
A minimally compliant decoder that just about achieves all the minimum
requirements of the level that it advertises is guaranteed to be able to
decode a valid stream if it caches 10 frames. If it caches fewer frames
then it is not.. A decoder may cache more than 10 frames - if it is capable
of decoding faster than required and wants to run ahead of schedule, for
example - but it doesn't have to.
…On Fri, Sep 14, 2018 at 12:04 AM, Steve Lhomme ***@***.***> wrote:
I interpreted Steve's comment to mean that, as per the example I provided,
we can conceive of a GOP structure that requires more than 10-frame
buffers. Whilst AV1 restricts signaling of the delay to 4-bits, so 16
frames, a decoder that only provides the minimal 10 frame buffers that AV1
mandates would not be able to decode the 11th frame until one of the 10
available frame buffers becomes free, presumably as the result of a display
event. A decoder / application may choose to provide a larger number of
frame buffers, 100 say, which would allow it to decode way beyond the 10th
frame before displaying any frame. But a compliant bitstream cannot rely on
that over-provisioning. And we are still able to signal a maximum delay of
16 frames.
I think I understand the nuance now. A compliant decoder should only cache
10 frames at most. So even if the TU contains 100 frames the decoder will
still only have 10 max frames in its cache. So it can never be more than 10
(minus/plus 1 depending on how you count).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#101 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAQoPtyXAWD5vChNKnEpIlq4z_vaVmdOks5ua1T4gaJpZM4WZjCV>
.
|
It is really unfortunate that this discussion spawned around/after the v1 of the specification got "frozen", and this is probably partially because people only start implementing something when they know that it will not wildly change any more. But this is what it is. This whole value seems like something that should be a header flag in the AV1 bit stream a la Also one would think that things like this could be handled on the SW side of the hwdec implementation, where it would just return "feed me more" until it can actually return the following coded image in PTS/CTS order. As I think this is how hwdec generally works for AVC/HEVC? Given that required header/initialization values are available to the actual decoder/parser that feeds to the hwdec implementation, of course.
So do I understand it correctly that writing this value at all is 100% voluntary and that there is a boolean somewhere to mention if you could come up with a value for this field or not? Or do you mean writing the default (10 buffered frames) as "omit"? |
Yes |
That is just a composition time offset at the TU where the decoder requires more TUs, isn't that?
I don't think this approach makes the issue simpler. If you really want to avoid negative CTS offsets it is enough that the spec forbits it. Also I don't think the edit list is a complex thing. You say it is hint for players, but if players ignore it even if initial_presentation_delay is present, there is possibly jerkiness, isn't there? So I think it is not an ignorable thing and a hint for almost all players. This is a similar thing the demuxer or player don't know the edit list and may introduce AV-async. The initial_presentation_delay only makes sense for AV1. To support AV1 in ISOBMFF, the muxer and demuxer need support initial_presentation_delay in addition to the decoder initialization record as the minimum implementation. I believe that container file formats should hides and opacify the CODEC specific properties as much as possible to treat any encapsulated CODECs by the common ways. If the initial_presentation_delay is defiend in the spec of ISOBMFF, I don't give such unpleasant words. :( |
The issue here is that a TU (Sample in ISOBMFF/Block in Matroska) may contain more than one frame to decode, invisible to the container. But it's not necessarily at the beginning of the stream. The Sequence Header OBU may have the information but as frames, not TU. And it doesn't say how many frames can be packed in a TU during the whole Sequence it describes. So in any case, we cannot have this information from the decoder. |
For context, a related issue in dav1d: https://code.videolan.org/videolan/dav1d/-/issues/406 |
Is it always safe to copy initial_display_delay from the sequence header to initial_presentation_delay in ISOBMFF? Looking at the definitions, although one is per frame and one is per sample, I cannot think of a case where copying the value violates the ISOBMFF wording, because every sample is guaranteed to output a frame. (If so, at least for some encoders, producing initial_display_delay values is trivial and would make for an easy conformance bitstream) |
We should revisit this issue once we have conformance streams exercising the feature. |
We intend to close this issue when conformance files are provided (#180) |
The current definition of the initial_presentation_delay is "interesting" if you want to build a decoder, but it's completely non-trivial to figure out what value to put there when doing a muxer. When trying to produce it in GStreamer, I'm completely lost. I noticed that neither ffmpeg nor libaom has any code that implements it.
Can we please, either:
The text was updated successfully, but these errors were encountered: