Playback: audio-video synchronization algorithm

Question

I'm in the process of creating a very basic video player with the ffmpeg libraries and I have all the decoding and re-encoding in place, but I'm stuck on audio video synchronization.

My problem is, movies have audio and video streams muxed (intertwined) in a way that audio and video comes in "bursts" (a number of audio packets, followed by juxtaposed video frames), like this, where each packet has its own timestamp:

A A A A A A A A V V V V A A A A A A A V V V V ...

A: decoded and re-encoded audio data chunk
V: decoded and re-encoded video frame

supposedly in a way to prevent too much audio to be processed without video, and the other way around.

Now I have to decode the "bursts" and send them to the audio/video playing components in a timely fashion, and I am a bit lost in the details.

is there a "standard" strategy/paradigm/pattern to face this kind of problems?
are there tutorials/documentation/books around explaining the techniques to use?
how far can the muxing go in a well coded movie?

Because I don't expect anything like this:

AAAAAAAAAAA .... AAAAAAAAAAAAA x10000 VVVVVVVVVVVVVV x1000
audio for the whole clip followed by video

or this:

VVVVVVVVVVVV x1000 AAAAAAAAAAA...AAAAAAAAA x1000    
all video frames followed by the audio

to happen in a well encoded video (after all, preventing such extremes is what muxing is all about...)

Thanks!

UPDATE: since my description might have been unclear, the issue is not with how the streams are, or about how to decode them: the whole audio/video demuxing, decoding, rescaling and re-encoding is set and sound, and each chunk of data has its own timestamp.

My problem is what to do with the decoded data without incurring in buffer overrun and underrun and, generally, clogging my pipeline, so I guess it might be considered a "scheduling" problem.

did you solve this problem I have a similar issue trying to sync multiple streams I have their timestamp that its their relative elapsed time. — Teocci
– Teocci, Commented Nov 27, 2018 at 2:38
@Teocci I didn't, I just put the problem on hold and I will address manual A/V decoding another time, sorry! — Rick77
– Rick77, Commented Nov 28, 2018 at 10:51

Rick77 · Accepted Answer · 2016-08-07 14:37:03Z

I'll elaborate a bit on @szatmary answer, which I have rightfully marked as correct, though I failed to immediately recognize it as such.

Mind that this is my take on his answer, I'm not implying anything about his intentions, maybe he meant something completely different...

After a bit of musing, I concluded that

Sync is the job of the container.

could be interpreted as "don't waste too much time into gimmicks to schedule audio and video frames, since the container presents the data in a easy-to-consume-way already".

To prove that, I have investigated a couple of video streams and found out that the audio and video data "bursts" come in a way that allow this approach:

decode and save all audio data as it comes, and don't worry too much about how much of it is coming: there's not enough audio to slow the video processing, or overflow a "reasonable" buffer, for sufficient amounts of reasonable (which doesn't mean I don't take precautions against overflow).
decode each video frame as it comes, and just wait until it's time to present it

This works because:

the amount of audio data in a burst is tiny enough not to consume too much memory or CPU time, so it can be decoded between frames, even though it encompasses the period of multiple frames
the audio and video "bursts" (of, respectively, samples and frames) are nicely ordered, which means that the period covered by a burst samples pretty much covers the period of the following burst of frames.

Everything else is just a "trivial matter of coding" ;-)

szatmary · Accepted Answer · 2016-08-06 16:10:37Z

2

Sync is the job of the container. Every frame will be time stamped with a PTS/DTS or duration/CTS

answered Aug 6, 2016 at 16:10

szatmary

31.3k8 gold badges49 silver badges58 bronze badges

2 Comments

Rick77 Over a year ago

thank you for your answer, @szatmary, however I'm not sure we are speaking about the same thing. I already have the timestamp for each frame and audio chunk, my problem is how to set a pipeline that reads an unspecified number of audio/video packets without clogging or lagging. How much buffering do I need (if any)? Is there a preferred approach? This kind of stuff... [btw I have updated my question to make it more clear]

Rick77 Over a year ago

also, I would have sent you a private message, but it's not possible on the stackexchange network: I checked your CV (congratulations for your career btw), and the link szatmary.org seem to be broken. Thought you might be interested.

Collectives™ on Stack Overflow

Playback: audio-video synchronization algorithm

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related