I have an iOS/macOS tennis app that now lets the user import video, and I would like to add the ability to automatically edit out the significant amount of downtime where players are not in a rally or point. To start I considered filtering the video down to segments where tennis ball trajectories are detected, however I don't think it's possible the associated video times for trajectory occurrences.
If I'm making two splits for each segment of downtime, I could use pose detection (applies to video as well as photos) to find all the serve motions (how each point starts) or feeds (how each practice rally starts) for each ending split.
For each starting split that marks the beginning of a downtime segment though I'm not sure what would be the simplest approach to detect that a point or rally ended. Detecting hand signals for "out" wouldn't cover a lot of ways in which a point ends, neither would audio analysis for verbal out calls and certainly not for practice sessions. Any guidance for what might be a relatively simple and comprehensive approach here (likely using a machine learning framework) would be greatly appreciated.
Expected video comes from everyday tennis players filmed from behind the baseline against the fence, and their own court would the dominant part of the frame.