Dual-hierarchical knowledge distillation for video captioning
•Introduce a dual-hierarchical distillation framework for video captioning.•Propose a caption quality grading method to reduce annotation noise.•Distill semantic features into a lightweight student model for efficiency.•Improve captioning accuracy while lowering computational cost.•Outperform prior...
Saved in:
| Published in: | Pattern recognition Vol. 171; p. 112192 |
|---|---|
| Main Authors: | , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Elsevier Ltd
01.03.2026
|
| Subjects: | |
| ISSN: | 0031-3203 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | •Introduce a dual-hierarchical distillation framework for video captioning.•Propose a caption quality grading method to reduce annotation noise.•Distill semantic features into a lightweight student model for efficiency.•Improve captioning accuracy while lowering computational cost.•Outperform prior methods on MSVD and MSR-VTT benchmarks.
Video captioning remains a challenging task due to the tradeoff between accuracy and computational efficiency. We propose CapDistill, a dual hierarchical distillation framework that transfers semantic knowledge from a powerful teacher model to a lightweight student model. CapDistill captures object-level and action-level semantics from captions and transfers multilevel knowledge including object features, action features, and word-level predictions through a hierarchical strategy. To reduce the impact of noisy annotations, we introduce a caption quality grading mechanism that assigns quality-based weights to training captions. Experiments on MSR-VTT and MSVD demonstrate that CapDistill achieves state-of-the-art accuracy while significantly reducing inference cost. Code is available at: https://github.com/ccc000-png/SFTCap. |
|---|---|
| ISSN: | 0031-3203 |
| DOI: | 10.1016/j.patcog.2025.112192 |