Zotkin, D.N.[Dmitry N.],
Duraiswami, R.[Ramani],
Davis, L.S.[Larry S.],
Joint Audio-Visual Tracking Using Particle Filters,
JASP(2002), No. 11, November 2002, pp. 1154.
WWW Link.
0304
BibRef
Garg, A.[Ashutosh],
Pavlovic, V.[Vladimir],
Rehg, J.M.[James M.],
Boosted learning in dynamic Bayesian networks for multimodal speaker
detection,
PIEEE(91), No. 9, September 2003, pp. 1355-1369.
IEEE DOI
0309
BibRef
Earlier:
Audio-visual speaker detection using dynamic Bayesian networks,
AFGR00(384-390).
IEEE DOI
0003
BibRef
Pavlovic, V.[Vladimir],
Garg, A.[Ashutosh],
Rehg, J.M.[James M.],
Huang, T.S.[Thomas S.],
Multimodal Speaker Detection using Error Feedback Dynamic Bayesian
Networks,
CVPR00(II: 34-41).
IEEE DOI
0005
BibRef
Pavlovic, V.,
Berry, G., and
Huang, T.S.,
Integration of Audio/Visual Information for Use in
Human-Computer Intelligent Interaction,
ICIP97(I: 121-124).
IEEE DOI
BibRef
9700
Choudhury, T.[Tanzeem],
Rehg, J.M.,
Pavlovic, V.,
Pentland, A.P.,
Boosting and structure learning in dynamic Bayesian networks for
audio-visual speaker detection,
ICPR02(III: 789-794).
IEEE DOI
0211
BibRef
Pavlovic, V.[Vladimir],
Multimodal tracking and classification of audio-visual features,
ICIP98(I: 343-347).
IEEE DOI
9810
BibRef
Rehg, J.M.[James M.],
Murphy, K.P.[Kevin P.],
Fieguth, P.W.[Paul W.],
Vision-Based Speaker Detection Using Bayesian Networks,
CVPR99(II: 110-116).
IEEE DOI More particuarly the one talking.
BibRef
9900
Vajaria, H.[Himanshu],
Sankar, R.[Ravi],
Kasturi, R.[Ranga],
Exploring Co-Occurence Between Speech and Body Movement for
Audio-Guided Video Localization,
CirSysVideo(18), No. 11, November 2008, pp. 1608-1617.
IEEE DOI
0811
BibRef
Vajaria, H.[Himanshu],
Islam, T.[Tanmoy],
Sarkar, S.[Sudeep],
Sankar, R.[Ravi],
Kasturi, R.[Ranga],
Audio Segmentation and Speaker Localization in Meeting Videos,
ICPR06(II: 1150-1153).
IEEE DOI
0609
BibRef
Talantzis, F.,
Pnevmatikakis, A.,
Constantinides, A.G.,
Audio-Visual Active Speaker Tracking in Cluttered Indoors Environments,
SMC-B(39), No. 1, February 2009, pp. 7-15.
IEEE DOI
0902
BibRef
Earlier:
SMC-B(38), No. 3, June 2008, pp. 799-807.
IEEE DOI
0711
The top one is the special issue, it was published early in the other issue.
BibRef
Lee, J.S.[Jong-Seok],
de Simone, F.[Francesca],
Ebrahimi, T.[Touradj],
Efficient video coding based on audio-visual focus of attention,
JVCIR(22), No. 8, November 2011, pp. 704-711.
Elsevier DOI
1110
Video coding; Audio-visual focus of attention; Quality of experience;
Audio-visual source localization; H.264/AVC; Flexible macroblock
ordering (FMO); Canonical correlation analysis; Subjective quality
assessment
BibRef
Blauth, D.A.[Dante A.],
Minotto, V.P.[Vicente P.],
Jung, C.R.[Claudio R.],
Lee, B.[Bowon],
Kalker, T.[Ton],
Voice activity detection and speaker localization using audiovisual
cues,
PRL(33), No. 4, March 2012, pp. 373-380.
Elsevier DOI
1201
User interfaces; Voice activity detection; Speaker localization;
Multimodal analysis; Hidden Markov Models
BibRef
Montazzolli, S.,
Jung, C.R.,
Gelb, D.[Dan],
Audiovisual voice activity detection using off-the-shelf cameras,
ICIP15(3886-3890)
IEEE DOI
1512
Lip Movement
BibRef
Minotto, V.P.[V. Peruffo],
Jung, C.R.[C. Rosito],
Lee, B.[Bowon],
Simultaneous-Speaker Voice Activity Detection and Localization Using
Mid-Fusion of SVM and HMMs,
MultMed(16), No. 4, June 2014, pp. 1032-1044.
IEEE DOI
1407
Accuracy
BibRef
Qian, X.,
Brutti, A.,
Lanz, O.,
Omologo, M.,
Cavallaro, A.,
Multi-Speaker Tracking From an Audio-Visual Sensing Device,
MultMed(21), No. 10, October 2019, pp. 2576-2588.
IEEE DOI
1910
image colour analysis, object detection, object tracking,
particle filtering (numerical methods), sensor fusion,
particle filter
BibRef
Pu, J.,
Panagakis, Y.,
Pantic, M.,
Active Speaker Detection and Localization in Videos Using Low-Rank
and Kernelized Sparsity,
SPLetters(27), 2020, pp. 865-869.
IEEE DOI
2006
Sparse matrices, Kernel, Visualization, Matrix decomposition, Videos,
Correlation, Spectrogram, Active speaker localization,
kernels
BibRef
Qian, X.Y.[Xin-Yuan],
Liu, Q.[Qi],
Wang, J.D.[Jia-Dong],
Li, H.Z.[Hai-Zhou],
Three-Dimensional Speaker Localization: Audio-Refined Visual Scaling
Factor Estimation,
SPLetters(28), 2021, pp. 1405-1409.
IEEE DOI
2108
Location awareness, Visualization,
Cameras, Microphone arrays, Estimation, Adaptive arrays,
dynamic sensor weighting
BibRef
Ban, Y.T.[Yu-Tong],
Alameda-Pineda, X.[Xavier],
Girin, L.[Laurent],
Horaud, R.[Radu],
Variational Bayesian Inference for Audio-Visual Tracking of Multiple
Speakers,
PAMI(43), No. 5, May 2021, pp. 1761-1776.
IEEE DOI
2104
BibRef
Earlier: A1, A3, A2, A4:
Exploiting the Complementarity of Audio and Visual Data in
Multi-speaker Tracking,
CVAVM17(446-454)
IEEE DOI
1802
Visualization, Target tracking, Acoustics, Bayes methods, Cameras,
Object tracking, Direction-of-arrival estimation,
speaker diarization.
Cameras, Detectors, Kalman filters, Microphones, Robots, Tracking,
Visualization
BibRef
Qian, X.Y.[Xin-Yuan],
Brutti, A.[Alessio],
Lanz, O.[Oswald],
Omologo, M.[Maurizio],
Cavallaro, A.[Andrea],
Audio-Visual Tracking of Concurrent Speakers,
MultMed(24), 2022, pp. 942-954.
IEEE DOI
2202
Target tracking, Acoustics, Faces, Cameras, Visualization,
Image color analysis, 3D multiple target tracking,
particle filter
BibRef
Hu, D.[Di],
Wei, Y.[Yake],
Qian, R.[Rui],
Lin, W.Y.[Wei-Yao],
Song, R.H.[Rui-Hua],
Wen, J.R.[Ji-Rong],
Class-Aware Sounding Objects Localization via Audiovisual
Correspondence,
PAMI(44), No. 12, December 2022, pp. 9844-9859.
IEEE DOI
2212
Where did the sound come from.
Location awareness, Visualization, Task analysis, Annotations,
Semantics, Dictionaries, Videos, distribution alignment
BibRef
Zheng, A.[Aihua],
Hu, M.[Menglan],
Jiang, B.[Bo],
Huang, Y.[Yan],
Yan, Y.[Yan],
Luo, B.[Bin],
Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching,
MultMed(24), 2022, pp. 338-351.
IEEE DOI
2202
Visualization, Task analysis, Measurement, Speech recognition,
Videos, Location awareness, Image recognition, metric learning
BibRef
Wang, Y.[Yusen],
Qian, X.H.[Xiao-Hong],
Zhou, W.[Wujie],
Transformer-Prompted Network: Efficient Audio-Visual Segmentation via
Transformer and Prompt Learning,
SPLetters(32), 2025, pp. 516-520.
IEEE DOI
2501
Transformers, Feature extraction, Frequency-domain analysis,
Europe, Visualization, Location awareness, Convolution,
self-knowledge distillation
BibRef
Wang, H.[Hao],
Zha, Z.J.[Zheng-Jun],
Li, L.[Liang],
Chen, X.J.[Xue-Jin],
Luo, J.B.[Jie-Bo],
Semantic and Relation Modulation for Audio-Visual Event Localization,
PAMI(45), No. 6, June 2023, pp. 7711-7725.
IEEE DOI
2305
Visualization, Location awareness, Correlation, Proposals, Semantics,
Task analysis, Modulation, Audio-visual learning, normalization
BibRef
Garg, R.[Rishabh],
Gao, R.H.[Ruo-Han],
Grauman, K.[Kristen],
Visually-Guided Audio Spatialization in Video with Geometry-Aware
Multi-task Learning,
IJCV(131), No. 10, October 2023, pp. 2723-2737.
Springer DOI
2309
BibRef
Wang, J.X.[Jia-Xiang],
Li, C.L.[Cheng-Long],
Zheng, A.[Aihua],
Tang, J.[Jin],
Luo, B.[Bin],
Looking and Hearing Into Details:
Dual-Enhanced Siamese Adversarial Network for Audio-Visual Matching,
MultMed(25), 2023, pp. 7505-7516.
IEEE DOI
2311
BibRef
Liu, C.[Chen],
Li, P.[Peike],
Zhang, H.[Hu],
Li, L.C.[Lin-Cheng],
Huang, Z.[Zi],
Wang, D.D.[Da-Dong],
Yu, X.[Xin],
BAVS: Bootstrapping Audio-Visual Segmentation by Integrating
Foundation Knowledge,
MultMed(26), 2024, pp. 10015-10028.
IEEE DOI
2410
Visualization, Semantics, Location awareness, Background noise,
Task analysis, White noise, Transformers,
and audio-visual hierarchical trees
BibRef
Liu, C.[Chen],
Li, P.[Peike],
Yang, L.Y.[Li-Ying],
Wang, D.D.[Da-Dong],
Li, L.C.[Lin-Cheng],
Yu, X.[Xin],
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent
Alignment,
CVPR25(28922-28931)
IEEE DOI
2508
Visualization, Uncertainty, Attention mechanisms, Accuracy, Merging,
Estimation, Contrastive learning, Reliability,
audio visual segmentation
BibRef
Liu, C.[Chen],
Li, P.P.[Peike Patrick],
Yu, Q.[Qingtao],
Sheng, H.W.[Hong-Wei],
Wang, D.D.[Da-Dong],
Li, L.C.[Lin-Cheng],
Yu, X.[Xin],
Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos,
CVPR24(22712-22722)
IEEE DOI Code:
WWW Link.
2410
Location awareness, Visualization, Adaptation models, Annotations,
Grounding, Deformation, Multimodal Processing,
sounding source localization
BibRef
Traa, J.,
Smaragdis, P.,
A Wrapped Kalman Filter for Azimuthal Speaker Tracking,
SPLetters(20), No. 12, 2013, pp. 1257-1260.
IEEE DOI
1311
Approximation methods
BibRef
Li, Y.[Yidi],
Liu, H.[Hong],
Yang, B.[Bing],
STNet: Deep Audio-Visual Fusion Network for Robust Speaker Tracking,
MultMed(27), 2025, pp. 1835-1847.
IEEE DOI
2504
Visualization, Feature extraction, Acoustics,
Acoustic measurements, Target tracking, Location awareness,
cross-modal attention
BibRef
Shi, Z.F.[Zhao-Feng],
Wu, Q.B.[Qing-Bo],
Meng, F.M.[Fan-Man],
Xu, L.F.[Lin-Feng],
Li, H.L.[Hong-Liang],
Cross-Modal Cognitive Consensus Guided Audio-Visual Segmentation,
MultMed(27), 2025, pp. 209-223.
IEEE DOI
2501
Visualization, Semantics, Feature extraction, Object segmentation,
Location awareness, Data mining, Feeds, Attention mechanisms,
semantic-level consistency
BibRef
Senocak, A.[Arda],
Ryu, H.[Hyeonggon],
Kim, J.[Junsik],
Oh, T.H.[Tae-Hyun],
Pfister, H.[Hanspeter],
Chung, J.S.[Joon Son],
Toward Interactive Sound Source Localization:
Better Align Sight and Sound!,
PAMI(47), No. 9, September 2025, pp. 7643-7659.
IEEE DOI
2508
Location awareness, Benchmark testing, Visualization, Measurement,
Semantics, Contrastive learning, Cross modal retrieval,
cross-modal retrieval
BibRef
Kim, I.H.[In-Ho],
Song, Y.[Youngkil],
Park, J.[Jicheol],
Kim, W.H.[Won Hwa],
Kwak, S.[Suha],
Improving Sound Source Localization with Joint Slot Attention on
Image and Audio,
CVPR25(3121-3130)
IEEE DOI
2508
Location awareness, Computational modeling, Noise, Contrastive learning,
Benchmark testing, Vectors, Standards, Cross modal retrieval
BibRef
Liu, C.[Chen],
Yang, L.Y.[Li-Ying],
Li, P.[Peike],
Wang, D.D.[Da-Dong],
Li, L.[Lincheng],
Yu, X.[Xin],
Dynamic Derivation and Elimination: Audio Visual Segmentation with
Enhanced Audio Semantics,
CVPR25(3131-3141)
IEEE DOI
2508
Representation learning, Visualization, Matched filters, Codes,
Semantics, Object segmentation, Benchmark testing,
audio visual localization
BibRef
Ryu, H.[Hyeonggon],
Kim, S.[Seongyu],
Chung, J.S.[Joon Son],
Senocak, A.[Arda],
Seeing Speech and Sound: Distinguishing and Locating Audio Sources in
Visual Scenes,
CVPR25(13540-13549)
IEEE DOI
2508
Visualization, Accuracy, Grounding, Computational modeling,
Complexity theory, Standards, Cross modal retrieval
BibRef
Wang, X.Z.[Xi-Zi],
Cheng, F.[Feng],
Bertasius, G.[Gedas],
LoCoNet: Long-Short Context Network for Active Speaker Detection,
CVPR24(18462-18472)
IEEE DOI Code:
WWW Link.
2410
Convolutional codes, Visualization, Benchmark testing, Robustness,
Convolutional neural networks
BibRef
Huang, C.[Chao],
Tian, Y.P.[Ya-Peng],
Kumar, A.[Anurag],
Xu, C.L.[Chen-Liang],
Egocentric Audio-Visual Object Localization,
CVPR23(22910-22921)
IEEE DOI
2309
BibRef
Nugroho, M.A.[Muhammad Adi],
Woo, S.[Sangmin],
Lee, S.[Sumin],
Kim, C.[Changick],
Audio-Visual Glance Network for Efficient Video Recognition,
ICCV23(10116-10125)
IEEE DOI
2401
BibRef
Liu, Y.[Yang],
Tan, Y.[Ying],
Lan, H.Y.[Hao-Yuan],
Self-Supervised Contrastive Learning for Audio-Visual Action
Recognition,
ICIP23(1000-1004)
IEEE DOI
2312
BibRef
Mo, S.T.[Shen-Tong],
Morgado, P.[Pedro],
Localizing Visual Sounds the Easy Way,
ECCV22(XXXVII:218-234).
Springer DOI
2211
BibRef
Xia, Y.[Yan],
Zhao, Z.[Zhou],
Cross-modal Background Suppression for Audio-Visual Event
Localization,
CVPR22(19957-19966)
IEEE DOI
2210
Location awareness, Visualization, Codes, Logic gates,
Feature extraction, Robustness, Action and event recognition,
Vision + X
BibRef
Jiang, H.[Hao],
Murdock, C.[Calvin],
Ithapu, V.K.[Vamsi Krishna],
Egocentric Deep Multi-Channel Audio-Visual Active Speaker
Localization,
CVPR22(10534-10542)
IEEE DOI
2210
Location awareness, Voice activity detection, Visualization,
Machine vision, Lighting, Real-time systems, Microphone arrays,
Vision applications and systems
BibRef
Min, K.[Kyle],
Roy, S.[Sourya],
Tripathi, S.[Subarna],
Guha, T.[Tanaya],
Majumdar, S.[Somdeb],
Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection,
ECCV22(XXXV:371-387).
Springer DOI
2211
BibRef
Duan, B.[Bin],
Tang, H.[Hao],
Wang, W.[Wei],
Zong, Z.L.[Zi-Liang],
Yang, G.W.[Guo-Wei],
Yan, Y.[Yan],
Audio-Visual Event Localization via Recursive Fusion by Joint
Co-Attention,
WACV21(4012-4021)
IEEE DOI
2106
Location awareness, Visualization, Fuses,
Task analysis
BibRef
Wu, Y.[Yu],
Zhu, L.C.[Lin-Chao],
Yan, Y.[Yan],
Yang, Y.[Yi],
Dual Attention Matching for Audio-Visual Event Localization,
ICCV19(6291-6299)
IEEE DOI
2004
feature extraction, image fusion,
video signal processing,
Video sequences
BibRef
Majumder, S.[Sagnik],
Al-Halah, Z.[Ziad],
Grauman, K.[Kristen],
Move2Hear: Active Audio-Visual Source Separation,
ICCV21(275-285)
IEEE DOI
2203
Solid modeling, Source separation, Robot vision systems,
Reinforcement learning, Ear, Vision + other modalities,
Vision for robotics and autonomous vehicles
BibRef
Majumder, S.[Sagnik],
Grauman, K.[Kristen],
Active Audio-Visual Separation of Dynamic Sound Sources,
ECCV22(XXIX:551-569).
Springer DOI
2211
BibRef
Alcázar, J.L.[Juan León],
Heilbron, F.C.[Fabian Caba],
Thabet, A.K.[Ali K.],
Ghanem, B.[Bernard],
MAAS: Multi-modal Assignation for Active Speaker Detection,
ICCV21(265-274)
IEEE DOI
2203
Visualization, Benchmark testing, Feature extraction,
Data structures, Task analysis, Vision + other modalities,
Video analysis and understanding
BibRef
Köpüklü, O.[Okan],
Taseska, M.[Maja],
Rigoll, G.[Gerhard],
How to Design a Three-Stage Architecture for Audio-Visual Active
Speaker Detection in the Wild,
ICCV21(1173-1183)
IEEE DOI
2203
Codes, Computational modeling, Pipelines,
Encoding, Task analysis, Vision + other modalities,
Vision applications and systems
BibRef
Wu, Y.[Yu],
Yang, Y.[Yi],
Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual
Video Parsing,
CVPR21(1326-1335)
IEEE DOI
2111
Training, Visualization, Target tracking,
Annotations, Predictive models
BibRef
Liu, H.[Hong],
Sun, Y.H.[Yong-Heng],
Li, Y.D.[Yi-Di],
Yang, B.[Bing],
3D Audio-Visual Speaker Tracking with A Novel Particle Filter,
ICPR21(7343-7348)
IEEE DOI
2105
BibRef
Earlier: A1, A3, A4, Only:
3D Audio-Visual Speaker Tracking with A Two-Layer Particle Filter,
ICIP19(1955-1959)
IEEE DOI
1910
Visualization, Histograms, Head,
Image color analysis, Sensor phenomena and characterization, compact platform.
3D speaker tracking, audio-visual fusion, particle filter, adaptive likelihood
BibRef
He, G.,
Liu, X.,
Fan, F.,
You, J.,
Image2Audio: Facilitating Semi-supervised Audio Emotion Recognition
with Facial Expression Image,
VL3W20(3978-3983)
IEEE DOI
2008
Spectrogram, Training, Emotion recognition,
Reliability, Visualization, Face recognition
BibRef
Le, N.[Nam],
Heili, A.[Alexandre],
Wu, D.[Di],
Odobez, J.M.[Jean-Marc],
Temporally subsampled detection for accurate and efficient face
tracking and diarization,
ICPR16(1792-1797)
IEEE DOI
1705
Detectors, Face, Face detection, Image color analysis,
Motion pictures, TV, Tracking
BibRef
Saeed, A.[Anwar],
Al-Hamadi, A.[Ayoub],
Heuer, M.[Michael],
Speaker Tracking Using Multi-modal Fusion Framework,
ICISP12(539-546).
Springer DOI
1208
BibRef
Kelly, D.[Damien],
Pitie, F.[Francois],
Kokaram, A.[Anil],
Boland, F.[Frank],
A Comparative Error Analysis of Audio-Visual Source Localization,
M2SFA208(xx-yy).
0810
BibRef
Katsarakis, N.[Nikos],
Talantzis, F.[Fotios],
Pnevmatikakis, A.[Aristodemos],
Polymenakos, L.[Lazaros],
The AIT 3D Audio / Visual Person Tracker for CLEAR 2007,
MTPH07(xx-yy).
Springer DOI
0705
See also AIT 2D Face Detection and Tracking System for CLEAR 2007, The.
See also AIT Multimodal Person Identification System for CLEAR 2007, The.
BibRef
Kushal, A.[Akash],
Rahurkar, M.[Mandar],
Fei-Fei, L.[Li],
Ponce, J.[Jean],
Huang, T.[Thomas],
Audio-Visual Speaker Localization Using Graphical Models,
ICPR06(I: 291-294).
IEEE DOI
0609
BibRef
Tsuji, T.[Tokuo],
Yamamoto, K.[Kenkichi],
Ishii, I.[Idaku],
Real-time Sound Source Localization Based on Audiovisual Frequency
Integration,
ICPR06(IV: 322-325).
IEEE DOI
0609
BibRef
Megherbi, N.,
Ambellouis, S.,
Colot, O.,
Cabestaing, F.,
Data Association in Multi-Target Tracking Using Belief Theory:
Handling Target Emergence and Disappearance Issue,
AVSBS05(517-521).
IEEE DOI
0602
BibRef
Megherbi, N.,
Ambellouis, S.,
Colot, O.,
Cabestaing, F.,
Joint audio-video people tracking using belief theory,
AVSBS05(135-140).
IEEE DOI
0602
BibRef
Li, X.[Xin],
Sun, L.[Luo],
Tao, L.M.[Lin-Mi],
Xu, G.Y.[Guang-You],
Jia, Y.[Ying],
A Speaker Tracking Algorithm Based on Audio and Visual Information
Fusion Using Particle Filter,
ICIAR04(II: 572-580).
Springer DOI
0409
BibRef
Lange, C.[Christian],
Hermann, T.[Thomas],
Ritter, H.[Helge],
Holistic Body Tracking for Gestural Interfaces,
GW03(132-139).
Springer DOI
0405
BibRef
Blake, A.,
Gangnet, M.,
Perez, P.,
Vermaak, J.,
Integrated tracking with vision and sound,
CIAP01(354-357).
IEEE DOI
0210
BibRef
Chapter on Face Recognition, Human Pose, Detection, Tracking, Gesture Recognition, Fingerprints, Biometrics continues in
Mouth Location, Lip Location, Detection .