Jointly Learning the Attributes and Composition of Shots for Boundary Detection in Videos

Xuekun Jiang1     Libiao Jin1     Anyi Rao*2     Linning Xu2     Dahua Lin2
*corresponding author



In film making, shot has a profound influence on how the movie content is delivered and how the audiences are echoed, where different emotions and contents can be delivered through well-designed camera movements or shot editing. Therefore, in pursuit of high-level understanding of long videos,accurate shot detection from untrimmed videos should be considered as the first and the most fundamental step.Existing approaches address this problem based on the visual differences and content transitions between consecutive frames, while ignoring intrinsic shot attributes, viz, camera movements, scales, and viewing angles, which essentially reveal how each shot is created.In this work, we propose a new learning framework (SCTSNet) for shot boundary detection by jointly recognizing the attributes and composition of shots in videos. To facilitate the analysis of shots and the evaluation of shot detection models, we collect a large-scale shot boundary dataset MovieShots2, which contains 15K shots from 282 movie clips.It is richly annotated with the temporal boundary between consecutive shots and individual shot attributes, including camera movements, scales, and viewing angles, which are the three most distinct shot attributes.Our experiments show that the joint learning framework can significantly boost the boundary detection performance, surpassing the previous scores by a large margin.SCTSNet improves shot boundary detection AP from 0.65 to 0.77, pushing the performance to a new level.