20.4.5.6.8 Long Video Understanding, Long-Form Video Understandint

Chapter Contents (Back)
Video Understanding. Long Form Video Understanding.

Pang, B.[Bo], Peng, G.[Gao], Li, Y.Z.[Yi-Zhuo], Lu, C.[Cewu],
Markov Progressive Framework, a Universal Paradigm for Modeling Long Videos,
PAMI(46), No. 12, December 2024, pp. 9749-9765.
IEEE DOI 2411
Videos, Computational modeling, Semantics, Training, Transformers, Task analysis, Solid modeling, Video understanding, progressive modeling BibRef

You, Z.[Zeng], Wen, Z.Q.[Zhi-Quan], Chen, Y.F.[Yao-Fo], Li, X.[Xin], Zeng, R.H.[Run-Hao], Wang, Y.W.[Yao-Wei], Tan, M.K.[Ming-Kui],
Toward Long Video Understanding via Fine-Detailed Video Story Generation,
CirSysVideo(35), No. 5, May 2025, pp. 4592-4607.
IEEE DOI 2505
Visualization, Termination of employment, Semantics, Large language models, Feature extraction BibRef


Liu, S.M.[Shu-Ming], Zhao, C.[Chen], Xu, T.Q.[Tian-Qi], Ghanem, B.[Bernard],
BOLT: Boost Large Vision-Language Model Without Training for Long-Form Video Understanding,
CVPR25(3318-3327)
IEEE DOI Code:
WWW Link. 2508
Training, Visualization, Laplace equations, Accuracy, Focusing, Fasteners, Benchmark testing, Noise measurement, Videos, frame selection BibRef

Jang, H.[Huiwon], Yu, S.[Sihyun], Shin, J.[Jinwoo], Abbeel, P.[Pieter], Seo, Y.[Younggyo],
Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction,
CVPR25(22853-22863)
IEEE DOI 2508
Chunks of long videos. Training, Solid modeling, Dynamics, Coherence, Transformers, Tokenization, Encoding, Video codecs, Videos, video tokenization, video generation BibRef

Man, Y.B.[Yuan-Bin], Huang, Y.[Ying], Zhang, C.M.[Cheng-Ming], Li, B.Z.[Bing-Zhe], Niu, W.[Wei], Yin, M.[Miao],
AdaCM2: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction,
CVPR25(8534-8544)
IEEE DOI 2508
Visualization, Adaptation models, Large language models, Memory management, Graphics processing units, Propulsion, multimodal language model BibRef

Ren, W.M.[Wei-Ming], Yang, H.[Huan], Min, J.[Jie], Wei, C.[Cong], Chen, W.[Wenhu],
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by VIdeo SpatioTemporal Augmentation,
CVPR25(3804-3814)
IEEE DOI 2508
Accuracy, Benchmark testing, Performance gain, Robustness, Spatiotemporal phenomena, Spatial resolution, Faces, Videos, synthetic dataset BibRef

Wang, Z.Y.[Zi-Yang], Yu, S.[Shoubin], Stengel-Eskin, E.[Elias], Yoon, J.[Jaehong], Cheng, F.[Feng], Bertasius, G.[Gedas], Bansal, M.[Mohit],
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos,
CVPR25(3272-3282)
IEEE DOI 2508
Training, Accuracy, Refining, Redundancy, Cognition, Data mining, Iterative methods, Feeds, Videos, long video understanding, LLM-based video understanding BibRef

Ye, J.H.[Jin-Hui], Wang, Z.[Zihan], Sun, H.[Haosen], Chandrasegaran, K.[Keshigeyan], Durante, Z.[Zane], Eyzaguirre, C.[Cristobal], Bisk, Y.[Yonatan], Niebles, J.C.[Juan Carlos], Adeli, E.[Ehsan], Fei-Fei, L.[Li], Wu, J.J.[Jia-Jun], Li, M.[Manling],
Re-thinking Temporal Search for Long-Form Video Understanding,
CVPR25(8579-8591)
IEEE DOI 2508
Training, Measurement, Location awareness, Visualization, Computational modeling, Benchmark testing, Search problems, Videos, temporal searching BibRef

Wang, L.[Lan], Chen, Y.J.[Yu-Jia], Tran, D.[Du], Boddeti, V.N.[Vishnu Naresh], Chu, W.S.[Wen-Sheng],
SEAL: SEmantic Attention Learning for Long Video Representation,
CVPR25(26192-26201)
IEEE DOI 2508
Grounding, Computational modeling, Semantics, Redundancy, Seals, Question answering (information retrieval), long video understanding BibRef

Pan, Y.[Yulu], Zhang, C.[Ce], Bertasius, G.[Gedas],
Basket: A Large-Scale Video Dataset for Fine-Grained Skill Estimation,
CVPR25(28952-28962)
IEEE DOI Code:
WWW Link. 2508
Analytical models, Codes, Accuracy, Computational modeling, Estimation, Predictive models, Videos, large-scale video dataset, long video understanding BibRef

Zhou, J.J.[Jun-Jie], Shu, Y.[Yan], Zhao, B.[Bo], Wu, B.[Boya], Liang, Z.Y.[Zheng-Yang], Xiao, S.T.[Shi-Tao], Qin, M.H.[Ming-Hao], Yang, X.[Xi], Xiong, Y.P.[Yong-Ping], Zhang, B.[Bo], Huang, T.J.[Tie-Jun], Liu, Z.[Zheng],
MLVU: Benchmarking Multi-task Long Video Understanding,
CVPR25(13691-13701)
IEEE DOI Code:
WWW Link. 2508
Degradation, Technological innovation, Surveillance, Benchmark testing, Multitasking, Motion pictures, Optimization, Videos BibRef

Shu, Y.[Yan], Liu, Z.[Zheng], Zhang, P.[Peitian], Qin, M.H.[Ming-Hao], Zhou, J.J.[Jun-Jie], Liang, Z.Y.[Zheng-Yang], Huang, T.J.[Tie-Jun], Zhao, B.[Bo],
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding,
CVPR25(26160-26169)
IEEE DOI 2508
Training, Learning systems, Visualization, Soft sensors, Large language models, Graphics processing units, Synthetic data BibRef

Tang, X.[Xi], Qiu, J.[Jihao], Xie, L.X.[Ling-Xi], Tian, Y.J.[Yun-Jie], Jiao, J.B.[Jian-Bin], Ye, Q.X.[Qi-Xiang],
Adaptive Keyframe Sampling for Long Video Understanding,
CVPR25(29118-29128)
IEEE DOI Code:
WWW Link. 2508
Visualization, Adaptation models, Codes, Large language models, Benchmark testing, Feeds, Optimization, Videos, video understanding, keyframe sampling BibRef

Ventura, L.[Lucas], Yang, A.[Antoine], Schmid, C.[Cordelia], Varol, G.[Gül],
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs,
CVPR25(18947-18958)
IEEE DOI 2508
Visualization, Codes, Navigation, Large language models, Computational modeling, Semantics, Speech recognition, Feeds, Videos, vidchapters-7m benchmark BibRef

Geng, T.T.[Tian-Tian], Zhang, J.[Jinrui], Wang, Q.[Qingni], Wang, T.[Teng], Duan, J.M.[Jin-Ming], Zheng, F.[Feng],
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos,
CVPR25(18959-18969)
IEEE DOI Code:
WWW Link. 2508
Filtering, Large language models, Pipelines, Semantics, Manuals, Benchmark testing, Data models, Labeling, Videos, long video understanding BibRef

Kim, J.[Junho], Kim, H.[Hyunjun], Lee, H.[Hosu], Ro, Y.M.[Yong Man],
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis,
CVPR25(3352-3362)
IEEE DOI 2508
Routing, Spatiotemporal phenomena, Videos, Context modeling BibRef

Song, E.[Enxin], Chai, W.H.[Wen-Hao], Wang, G.[Guanhong], Zhang, Y.C.[Yu-Cheng], Zhou, H.Y.[Hao-Yang], Wu, F.[Feiyang], Chi, H.Z.[Hao-Zhe], Guo, X.[Xun], Ye, T.[Tian], Zhang, Y.T.[Yan-Ting], Lu, Y.[Yan], Hwang, J.N.[Jenq-Neng], Wang, G.A.[Gao-Ang],
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding,
CVPR24(18221-18232)
IEEE DOI Code:
WWW Link. 2410
Visualization, Costs, Large language models, Computational modeling, Manuals, Transformers BibRef

Korbar, B.[Bruno], Xian, Y.Q.[Yong-Qin], Tonioni, A.[Alessio], Zisserman, A.[Andrew], Tombari, F.[Federico],
Text-conditioned Resampler For Long Form Video Understanding,
ECCV24(LXXXVI: 271-288).
Springer DOI 2412
BibRef

Wang, X.H.[Xiao-Han], Zhang, Y.H.[Yu-Hui], Zohar, O.[Orr], Yeung-Levy, S.[Serena],
Videoagent: Long-form Video Understanding with Large Language Model as Agent,
ECCV24(LXXX: 58-76).
Springer DOI 2412
BibRef

Weng, Y.[Yuetian], Han, M.F.[Ming-Fei], He, H.Y.[Hao-Yu], Chang, X.J.[Xiao-Jun], Zhuang, B.[Bohan],
LongVLM: Efficient Long Video Understanding via Large Language Models,
ECCV24(XXXIII: 453-470).
Springer DOI 2412
BibRef

He, B.[Bo], Li, H.[Hengduo], Jang, Y.K.[Young Kyun], Jia, M.L.[Meng-Lin], Cao, X.F.[Xue-Fei], Shah, A.[Ashish], Shrivastava, A.[Abhinav], Lim, S.N.[Ser-Nam],
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding,
CVPR24(13504-13514)
IEEE DOI 2410
Analytical models, Large language models, Memory management, Video sequences, Graphics processing units, Long-Term Video Understanding BibRef

Zhang, C.Y.[Chao-Yi], Lin, K.[Kevin], Yang, Z.Y.[Zheng-Yuan], Wang, J.F.[Jian-Feng], Li, L.J.[Lin-Jie], Lin, C.C.[Chung-Ching], Liu, Z.C.[Zi-Cheng], Wang, L.J.[Li-Juan],
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning,
CVPR24(13647-13657)
IEEE DOI 2410
Measurement, Visualization, Accuracy, Annotations, Memory management, Cognition, video understanding, LLM, in-context learning, multimodal, vision-and-language BibRef

Ren, S.[Shuhuai], Yao, L.[Linli], Li, S.C.[Shi-Cheng], Sun, X.[Xu], Hou, L.[Lu],
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding,
CVPR24(14313-14323)
IEEE DOI Code:
WWW Link. 2410
Location awareness, Visualization, Codes, Grounding, Large language models, Cognition, long video understanding BibRef

Xu, M.[Ming], Gould, S.[Stephen],
Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation,
CVPR24(14618-14627)
IEEE DOI 2410
Video on demand, Costs, Pipelines, Encoding, Web sites, Noise measurement, long-form video understanding, procedural videos BibRef

Rodin, I.[Ivan], Furnari, A.[Antonino], Min, K.[Kyle], Tripathi, S.[Subarna], Farinella, G.M.[Giovanni Maria],
Action Scene Graphs for Long-Form Understanding of Egocentric Videos,
CVPR24(18622-18632)
IEEE DOI Code:
WWW Link. 2410
Codes, Annotations, Manuals, Benchmark testing, Cameras, egocentric vision, scene graphs, long-form video understanding BibRef

Ataallah, K.[Kirolos], Shen, X.Q.[Xiao-Qian], Abdelrahman, E.[Eslam], Sleiman, E.[Essam], Zhuge, M.C.[Ming-Chen], Ding, J.[Jian], Zhu, D.[Deyao], Schmidhuber, J.[Jürgen], Elhoseiny, M.[Mohamed],
Goldfish: Vision-language Understanding of Arbitrarily Long Videos,
ECCV24(XXIX: 251-267).
Springer DOI 2412
BibRef

Afham, M.[Mohamed], Shukla, S.N.[Satya Narayan], Poursaeed, O.[Omid], Zhang, P.[Pengchuan], Shah, A.[Ashish], Lim, S.[Sernam],
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding,
REDLCV23(1181-1186)
IEEE DOI 2401
BibRef

Strafforello, O.[Ombretta], Schutte, K.[Klamer], van Gemert, J.C.[Jan C.],
Are current long-term video understanding datasets long-term?,
CVEU23(2959-2968)
IEEE DOI 2401
BibRef

Yang, X.T.[Xi-Tong], Chu, F.J.[Fu-Jen], Feiszli, M.[Matt], Goyal, R.[Raghav], Torresani, L.[Lorenzo], Tran, D.[Du],
Relational Space-Time Query in Long-Form Videos,
CVPR23(6398-6408)
IEEE DOI 2309
BibRef

Wang, J.[Jue], Zhu, W.T.[Wen-Tao], Wang, P.[Pichao], Yu, X.[Xiang], Liu, L.[Linda], Omar, M.[Mohamed], Hamid, R.[Raffay],
Selective Structured State-Spaces for Long-Form Video Understanding,
CVPR23(6387-6397)
IEEE DOI 2309
BibRef

Islam, M.M.[Md Mohaiminul], Bertasius, G.[Gedas],
Long Movie Clip Classification with State-Space Video Models,
ECCV22(XXXV:87-104).
Springer DOI 2211
BibRef

Wu, C.Y.[Chao-Yuan], Krähenbühl, P.[Philipp],
Towards Long-Form Video Understanding,
CVPR21(1884-1894)
IEEE DOI 2111
Visualization, Protocols, Computational modeling, Machine vision, Benchmark testing BibRef

Chapter on Implementations and Applications, Databases, QBIC, Video Analysis, Hardware and Software, Inspection continues in
Surveillance Video Summarization, Surveillance Synopsis .


Last update:Oct 6, 2025 at 14:07:43