1

I'm in the process of creating a very basic video player with the ffmpeg libraries and I have all the decoding and re-encoding in place, but I'm stuck on audio video synchronization.

My problem is, movies have audio and video streams muxed (intertwined) in a way that audio and video comes in "bursts" (a number of audio packets, followed by juxtaposed video frames), like this, where each packet has its own timestamp:

A A A A A A A A V V V V A A A A A A A V V V V ...

A: decoded and re-encoded audio data chunk
V: decoded and re-encoded video frame

supposedly in a way to prevent too much audio to be processed without video, and the other way around.

Now I have to decode the "bursts" and send them to the audio/video playing components in a timely fashion, and I am a bit lost in the details.

  1. is there a "standard" strategy/paradigm/pattern to face this kind of problems?
  2. are there tutorials/documentation/books around explaining the techniques to use?
  3. how far can the muxing go in a well coded movie?

Because I don't expect anything like this:

AAAAAAAAAAA .... AAAAAAAAAAAAA x10000 VVVVVVVVVVVVVV x1000
audio for the whole clip followed by video

or this:

VVVVVVVVVVVV x1000 AAAAAAAAAAA...AAAAAAAAA x1000    
all video frames followed by the audio

to happen in a well encoded video (after all, preventing such extremes is what muxing is all about...)

Thanks!

UPDATE: since my description might have been unclear, the issue is not with how the streams are, or about how to decode them: the whole audio/video demuxing, decoding, rescaling and re-encoding is set and sound, and each chunk of data has its own timestamp.

My problem is what to do with the decoded data without incurring in buffer overrun and underrun and, generally, clogging my pipeline, so I guess it might be considered a "scheduling" problem.

2
  • did you solve this problem I have a similar issue trying to sync multiple streams I have their timestamp that its their relative elapsed time. Commented Nov 27, 2018 at 2:38
  • @Teocci I didn't, I just put the problem on hold and I will address manual A/V decoding another time, sorry! Commented Nov 28, 2018 at 10:51

2 Answers 2

3

I'll elaborate a bit on @szatmary answer, which I have rightfully marked as correct, though I failed to immediately recognize it as such.

Mind that this is my take on his answer, I'm not implying anything about his intentions, maybe he meant something completely different...

After a bit of musing, I concluded that

Sync is the job of the container.

could be interpreted as "don't waste too much time into gimmicks to schedule audio and video frames, since the container presents the data in a easy-to-consume-way already".

To prove that, I have investigated a couple of video streams and found out that the audio and video data "bursts" come in a way that allow this approach:

  1. decode and save all audio data as it comes, and don't worry too much about how much of it is coming: there's not enough audio to slow the video processing, or overflow a "reasonable" buffer, for sufficient amounts of reasonable (which doesn't mean I don't take precautions against overflow).
  2. decode each video frame as it comes, and just wait until it's time to present it

This works because:

  • the amount of audio data in a burst is tiny enough not to consume too much memory or CPU time, so it can be decoded between frames, even though it encompasses the period of multiple frames
  • the audio and video "bursts" (of, respectively, samples and frames) are nicely ordered, which means that the period covered by a burst samples pretty much covers the period of the following burst of frames.

Everything else is just a "trivial matter of coding" ;-)

Sign up to request clarification or add additional context in comments.

Comments

2

Sync is the job of the container. Every frame will be time stamped with a PTS/DTS or duration/CTS

2 Comments

thank you for your answer, @szatmary, however I'm not sure we are speaking about the same thing. I already have the timestamp for each frame and audio chunk, my problem is how to set a pipeline that reads an unspecified number of audio/video packets without clogging or lagging. How much buffering do I need (if any)? Is there a preferred approach? This kind of stuff... [btw I have updated my question to make it more clear]
also, I would have sent you a private message, but it's not possible on the stackexchange network: I checked your CV (congratulations for your career btw), and the link szatmary.org seem to be broken. Thought you might be interested.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.