Query-based video summarization with multi-label classification network

Generic video summarization algorithms are characterized by the uniqueness of the final video summary result, which cannot satisfy the different summary requirements of different users for the same video. This paper addresses the task of query-based video summarization, which takes users’ queries an...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Multimedia tools and applications Jg. 82; H. 24; S. 37529 - 37549
Hauptverfasser:	Hu, Weifeng, Zhang, Yu, Li, Yujun, Zhao, Jia, Hu, Xifeng, Cui, Yan, Wang, Xuejing
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	New York Springer US 01.10.2023 Springer Nature B.V
Schlagworte:	Algorithms Classification Computer Communication Networks Computer Science Cross correlation Data Structures and Information Theory Datasets Deep learning Design Labels Multilayer perceptrons Multilayers Multimedia Multimedia Information Systems Neural networks Queries Semantics Special Purpose and Application-Based Systems User requirements Video data Deep learning Label correlation User subjectivity Multi-label classification Query-based video summarization
ISSN:	1380-7501, 1573-7721
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Generic video summarization algorithms are characterized by the uniqueness of the final video summary result, which cannot satisfy the different summary requirements of different users for the same video. This paper addresses the task of query-based video summarization, which takes users’ queries and long videos as inputs and aims to generate a query-based video summary. In this article, we propose a query-based video summarization algorithm with a multi-label classification network (MLC-SUM). Specifically, we treat video summarization as a target-based multi-label classification problem, and predict the correlation between video content and multi-concept labels by inputting convolutional features into a multi-layer perceptron, then use the cross-correlation of the labels to weight the predicted probability. Finally, we select the part of the video content with the highest relevance to the user’s query sentence as the video summary output. Experiments on three common datasets verify the effectiveness and superiority of the proposed algorithm.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1380-7501 1573-7721
DOI:	10.1007/s11042-023-15126-1