Dual-hierarchical knowledge distillation for video captioning

•Introduce a dual-hierarchical distillation framework for video captioning.•Propose a caption quality grading method to reduce annotation noise.•Distill semantic features into a lightweight student model for efficiency.•Improve captioning accuracy while lowering computational cost.•Outperform prior...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Pattern recognition Ročník 171; s. 112192
Hlavní autoři:	Luo, HuiLan, Wan, SiQi, Cai, Xia, Wang, ChanJuan, Chen, HongKun
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Elsevier Ltd 01.03.2026
Témata:	Encoder-decoder architecture Hierarchical semantic representation Knowledge distillation Teacher-student network Video captioning Video captioning Encoder-decoder architecture Knowledge distillation Teacher-student network Hierarchical semantic representation
ISSN:	0031-3203
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	•Introduce a dual-hierarchical distillation framework for video captioning.•Propose a caption quality grading method to reduce annotation noise.•Distill semantic features into a lightweight student model for efficiency.•Improve captioning accuracy while lowering computational cost.•Outperform prior methods on MSVD and MSR-VTT benchmarks. Video captioning remains a challenging task due to the tradeoff between accuracy and computational efficiency. We propose CapDistill, a dual hierarchical distillation framework that transfers semantic knowledge from a powerful teacher model to a lightweight student model. CapDistill captures object-level and action-level semantics from captions and transfers multilevel knowledge including object features, action features, and word-level predictions through a hierarchical strategy. To reduce the impact of noisy annotations, we introduce a caption quality grading mechanism that assigns quality-based weights to training captions. Experiments on MSR-VTT and MSVD demonstrate that CapDistill achieves state-of-the-art accuracy while significantly reducing inference cost. Code is available at: https://github.com/ccc000-png/SFTCap.
ISSN:	0031-3203
DOI:	10.1016/j.patcog.2025.112192