以數位語音處理技術解決異質視訊會議之同步問題

Hsiao-Pu Lin; 林孝蒲

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/42230

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	謝宏昀(Hung-Yun Hsieh)
dc.contributor.author	Hsiao-Pu Lin	en
dc.contributor.author	林孝蒲	zh_TW
dc.date.accessioned	2021-06-15T00:54:13Z	-
dc.date.available	2008-09-02
dc.date.copyright	2008-09-02
dc.date.issued	2008
dc.date.submitted	2008-08-07
dc.identifier.citation	[1] H.-Y. Hsieh, C.-W. Li, S.-W. Liao, Y.-W. Chen, T.-L. Tsai, and H.-P. Lin, “Moving toward end-to-end support for handoffs across heterogeneous telephony systems on dual-mode mobile devices,” Elsevier Computer Communications, Special Issue on End-to-End Support over Heterogeneous Wired-Wireless Networks, article in press, April 2007. [2] J. Grudin, E. Steven, and Poltrock, “Videoconferencing: Recent experiments andreassessment,” in Proceedings of the 38th Annual Hawaii International Conference on HICSS ’05, pp. 3–6, January 2005. [3] C.-W. Lin, Y.-C. Chen, and M.-T. Sun, “Dynamic region of interest transcoding for multipoint video conferencing,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 982– 992, Octobor 2003. [4] Akkus, I. Civanlar, and M. Ozkasap, “Peer-to-peer multipoint video conferencing using layered video,” Signal Processing and Communications Applications, pp. 1–4, April 2006. [5] VIBO Telecom Inc., “Taiwan 3g- vibo telecom.” Online Available at: http://www.compression-links.info/ [6] Cutler, Ross G. (Duvall, WA, US) and Bridgewater, Alan L. (Issaquah, WA, US), “Audio/video synchronization using audio hashing,” no. 20060291478, December 2006. Online Available at: http://www.freepatentsonline.com/20060291478.html [7] Octave Communications, “Octave communications.” Online Available at: http://www.octave.in/menu.htm [8] S. E. Poltrock and J. Grudin, “Videoconferencing: Recent experiments and reassessment,” IEEE, 2005. [9] J. Huang, W.-C. Feng, J. Walpole, and W. Jouve, “An experimental analysis of dct-based approaches for fine-grain multi-resolution video,” Multimedia Systems, pp. 513–531, January 2006. [10] K.-T. Fung, Y.-L. Chan, and W.-C. Siu, “Low-complexity and high-quality frame-skipping transcoder for continuous presence multipoint video conferencing,” IEEE Transactions on Multimedia, no. 1, pp. 31–46, February 2004. [11] C.-W. Lin, Y.-C. Chen, and M.-T. Sun, “Dynamic region of interest transcoding for multipoint video conferencing,” IEEE Transactions on Circuits and Systems for Video Technology, no. 10, pp. 982–992, October 2003. [12] M. Chen, G.-M. Su, and M. Wu, “Robust distributed multi-point video conferencing over error-prone channels,” 2006 IEEE International Conference on Multimedia and Expo, pp. 1149–1152, July 2006. [13] M. R. Civanlar, O. Ozkasap, and T. Celebi, “Peer-to-peer multipoint videoconferencing on the internet,” Signal Processing: Image Communication, pp. 743–754, May 2005. [14] I. E. Akkus, M. R. Civanlar, and O. Ozkasap, “Peer-to-peer multipoint video conferencing using layered video,” Signal Processing and Communications Applications, 2006 IEEE 14th, pp. 1–4, Signal Processing and Communications Applications, 2006 IEEE 14th 2006. [15] MpegTV, “Mpeg.org.” Online Available at: http://www.mpeg.org/ [16] C. Liu, Y. Xie, and M. J. Lee, “Multipoint multimedia teleconference system with adaptive synchronization,” IEEE JSAC, pp. 1422–1435, September 1996. [17] Y. Xie, C. Liu, M. J. Lee, and Y. N. Saadawi, “Adaptive multimedia synchronization in a teleconference system,” ACM/Springer Multimedia Systems, no. 4, pp. 326–337, 1999. [18] H. Liu and M. E. Zarki, “A synchronization control scheme for real-time streaming multimedia applications,” in Proceedings of 13th Packet Video Workshop, April 2003. [19] ——, “On the adaptive delay and synchronization control for video conferencing over the internet,” in Proceedings of ICN 2004, March 2004. [20] ——, “An adaptive delay and synchronization control scheme for wi-fi based audio/video conferencing,” Springer Science + Business Media, pp. 511–522, May 2006. [21] C.-C. Kuo, M.-S. Chen, and J.-C. Chen, “An adaptive transmission scheme for audio and video synchronization based on real-time transport protocol,” IEEE International Conference on Multimedia and Expo, pp. 525–528, 2001. [22] C. Kim, K. deok Seo, W. Sung, and S. heung Jung, “Efficient audio/video synchronization method for video telephony system in consumer cellular phones,” ICCE ’06 Consumer Electronics, pp. 137–138, January 2006. [23] M. Yang, N. Bourbakis, Z. Chen, and M. Trifas, “An efficient audio-video synchronization methodology,” IEEE International Conference on Multimedia and Expo, pp. 767–770, July 2007. [24] WIKIPEDIA, “Lip sync.” Online Available at: {http://en.wikipedia.org/wiki/ Lip sync} [25] G. Zoric and I. S. Pandzic, “A real-time lip sync system using a genetic algorithm for automatic neural network configuration,” IEEE International Conference on Multimedia and Expo, pp. 1366–1369, July 2005. [26] ——, “Automatic lip sync and its use in the new multimedia services for mobile devices,” in Proceedings of the 8th International Conference on Telecommunications, 2005, pp. 353–358, June 2005. [27] W.-N. Lie and H.-C. Hsieh, “Lips detection by morphological image processing,” in Proceedings of ICSP ’98, pp. 1084–1087, 1998. [28] D. F. McAllister, R. D. Rodman, D. L. Bitzer, and A. S. Freeman, “Lip synchronization of speech,” 1998. [29] GoldWave Inc., “Audio editing software- GoldWave.” Online Available at: http://www.goldwave.com/ [30] D. L. Mills, “Network time protocol (version 3) specification, implementation, and analysis,” RFC 1305, 1992. [31] S.-M. Jun, D.-H. Yu, Y.-H. Kim, and S.-Y. Seong, “A time synchronization method for ntp,” in RTCSA ’99: Proceedings of the Sixth International Conference on Real-Time Computing Systems and Applications. Washington, DC, USA: IEEE Computer Society, 1999, pp. 466–473. [32] R. Steinmetz, “Human perception for jitter and media synchronization,” IEEE Journal on Selected Areas in Comm., no. 1, pp. 61–72, January 1996. [33] H. Manhaiem, R. Silon, G. Fartuk, and S. Refael, Packet Loss Concealment Techniques and Algorithms. [34] The MathWorks, Inc, “The MathWorks- MATLAB and Simulink for technical computing.” Online Available at: http://www.mathworks.com/ [35] H. Boril and P. Pollak, “Direct time domain fundamental frequency estimation of speech in noisy conditions,” European Signal Processing Conference 2004, pp. 1003–1006, September 2004. [36] T. Shimamura and H. Kobayashi, “Weighted autocorrelation for pitch extraction of noisy speech,” IEEE Transactions on Speech and Audio Processing,, pp. 727 – 730, Octobor 2001. [37] B. V. K. Kumar, A. Mahalanobis, and R. D. Juday, Correlation Pattern Recognition. Cambridge University Press, 2005. [38] VoiceAge Corporation, “Open amr initiative- amr codec.” Online Available at: www.voiceage.com [39] YUVSoft Corp. and Graphics and Media Lab, “Ultimate compression resources catalog.” Online Available at: http://www.compression-links.info/ [40] DSPRelated.com, “MATLAB code- Mel Frequency Cepstral Coefficients.” Online Available at: http://dsprelated.com/ [41] A. M. Reddy and B. Raj, “Soft mask methods for single-channel speaker separation,” IEEE Trans. Audio, Speech, Lang. Process., no. 6, pp. 1766–1776, August 2007. [42] P. Smaragdis, “Convolutive speech bases and their application to supervised speech separation,” IEEE Trans. Audio, Speech, Lang. Process., no. 1, pp. 1–12, January 2007. [43] B. Raj and P. Smaragdis, “Latent variable decomposition of spectrograms for single channel speaker separation,” IEEE Workshop on Applications of Signal Process. to Audio and Acoustics, October 2005. [44] D. Chazan, Y. Stettiner, and D. Malah, “Optimal multi-pitch estimation using the em algorithm for co-channel speech separation,” IEEE Trans. Audio, Speech, Lang. Process., 1993. [45] D. P. Morgan, E. B. George, L. T. Lee, and S. M. Kay, “Cochannel speaker separation by harmonic enhancement and suppression,” IEEE Trans. Audio, Speech, Lang. Process., no. 5, pp. 407–424, September 1997. [46] D.-L. Wang and G. J. Brown, Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Hoboken, New Jersey: John Wiley and Sons, 2006. [47] B. C. J. Moore, An Introduction to the Psychology of Hearing, 5th ed. San Diego, CA: Academic Press, 2003. [48] A. Jourjine, S. Richard, and O. Yilmaz, “Blind separation of disjoint orthogonal signals: Demixing n sources from 2 mixtures,” In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2985–2988, June 2000. [49] S. T. Roweis, “Factorial models and refiltering for speech separation and denoising,” In Proceedings of Eurospeech, pp. 1009–1012, September 2003. [50] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via timefrequency masking,” IEEE Transactions on Signal Processing, no. 7, pp. 1830– 1847, JULY 2004. [51] Z. Shan, J. Swary, and S. Aviyente, “Underdetermined source separation in the time-frequency domain,” ICASSP, pp. 945–948, 2007. [52] A. Aissa-El-Bey, K. Abed-Meraim, and Y. Grenier, “Underdetermined blind separation of audio sources from the time-frequency representation of their convolutive mixtures,” ICASSP, pp. 153–156, September 2007. [53] L. T. Nguyen, A. Belouchrani, K. Abed-Meraim, and B. Boashash, “Separating more sources than sensors using time-frequency distributions,” ISSPA, pp. 583– 586, August 2001. [54] S. Rickard, R. Balan, and J. Rosca, “Real-time time-frequency based blind source separation,” pp. 651–656, May 2001. Online Available at: http://citeseer.ist.psu.edu/rickard01realtime.html [55] P. Bofill and M. Zibulevsky, “Blind separation of more sources than mixtures using sparsity of their short-time fourier transform,” in Proc. Int. Workshop Independent Component Anal. Blind Source Separation, pp. 87–92, June 2000, helsinki, Finland. [56] S. Rickard and O. Yilmaz, “On the approximate W-Disjoint Orthogonality of speech,” ICASSP, pp. 13–17, May 2002. [57] M. Aoki, M. Okamoto, S. Aoki, H. Matsui, T. Sakurai, and Y. Kaneda, “Sound source segregation based on estimating incident angle of each frequency component of input signals acquired by multiple microphones,” Acoust. Sci. Technol., no. 2, pp. 149–157, February 2001.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/42230	-
dc.description.abstract	As the popularity of multi-functional telephony devices grows, traditional audio conference now may involve heterogeneous teleconferencing devices, including POTS phone, dual-mode smart phones, pocket PCs, and so on. Among these conferencing devices, some may have the capability of accessing IP networks and supporting video conferencing with peer devices in the audio conference so as to have better conferencing experience. In this scenario, it becomes necessary to synchronize between audio streams, traversed the PSTN network, and video streams, traversed the IP network. While related work has investigated the problem of audio/video synchronization, their scenario is limited to the synchronization within homogeneous network, hence they cannot be applied in the target scenario. Therefore, in this thesis we propose an end-to-end framework for audio/video synchronization. We then simplify the problem as one that requires only synchronization between PSTN and IP audio streams. We first employ a time-domain algorithm based on cross correlation and identify its ineffectiveness in synchronizing distorted audio streams, due to noises or packet losses. Hence, we seek to extract distortion-tolerant audio features by Digital Speech Processing techniques for synchronization. We apply MFCC in the synchronization algorithm and obtain respectable performance for audio streams distorted by codec and packet losses. However, MFCC is inherently vulnerable to overlapping speakers. Therefore, we leverage the sparsity of speeches in spectrograms to design the spectrogram-based synchronization algorithm, and achieve favorable performance for speech mixtures and noisy speech. Evaluation results show that using DSP techniques is helpful in solving the synchronization problem across PSTN audio streams and IP video streams in terms of accuracy and robustness.	en
dc.description.provenance	Made available in DSpace on 2021-06-15T00:54:13Z (GMT). No. of bitstreams: 1 ntu-97-R95942032-1.pdf: 1618719 bytes, checksum: d0073e8ae636cb3aeaa8aa2c1ba3282b (MD5) Previous issue date: 2008	en
dc.description.tableofcontents	ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Heterogeneous Teleconferencing Scenario . . . . . . . . . . . . . . . 6 2.1.1 Audio Conference Architecture . . . . . . . . . . . . . . . . . 6 2.1.2 Video Conference Architecture . . . . . . . . . . . . . . . . . 8 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Conventional IP Audio-Video Synchronization . . . . . . . . 10 2.2.2 Lip Synchronization . . . . . . . . . . . . . . . . . . . . . . . 12 CHAPTER 3 A FRAMEWORK FOR AUDIO-VIDEO SYNCHRONIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1 Synchronization Framework . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.1 Concept of Simplification . . . . . . . . . . . . . . . . . . . . 15 3.1.2 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Asynchronism Measurement . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.2 Measurement Results . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Challenges of Audio Synchronization . . . . . . . . . . . . . . . . . . 23 3.3.1 Distortion by Voice Codec . . . . . . . . . . . . . . . . . . . 24 3.3.2 Distortion by Noise . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.3 Interference by Other Conferees . . . . . . . . . . . . . . . . 24 3.3.4 Packet Loss in Wireless Connection . . . . . . . . . . . . . . 25 3.3.5 Reactiveness to Network Dynamics . . . . . . . . . . . . . . 25 CHAPTER 4 SYNCHRONIZATION BASED ON CROSS CORRELATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1 Time-domain Audio Features . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Cross-Correlation-Based Synchronization . . . . . . . . . . . . . . . 29 4.2.1 Basics of Cross Correlation . . . . . . . . . . . . . . . . . . . 29 4.2.2 Cross-Correlation Synchronization Module . . . . . . . . . . 29 4.3 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.1 Matching Window Size . . . . . . . . . . . . . . . . . . . . . 31 4.3.2 Search Step Size . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.3 Short Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4.1 Codec Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4.2 Noise Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4.3 Overlapping Speakers . . . . . . . . . . . . . . . . . . . . . . 41 4.4.4 Packet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4.5 Short Conclusion on Performance . . . . . . . . . . . . . . . 43 CHAPTER 5 SYNCHRONIZATION BASED ON MFCC . . . . . 45 5.1 MFCC-Based Synchronization . . . . . . . . . . . . . . . . . . . . . 45 5.1.1 Basics of MFCC . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2 Synchronization Algorithm Design . . . . . . . . . . . . . . . . . . . 48 5.2.1 Mathematical Analysis . . . . . . . . . . . . . . . . . . . . . 49 5.2.2 Similarity Metric . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3.1 Codec Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3.2 Misalignment of Analysis Windows . . . . . . . . . . . . . . 52 5.3.3 Noise Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.4 Overlapping Speakers . . . . . . . . . . . . . . . . . . . . . . 55 5.3.5 Packet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3.6 Short Conclusion on Performance . . . . . . . . . . . . . . . 57 CHAPTER 6 SYNCHRONIZATION BASED ON SPECTROGRAM 59 6.1 Spectrogram-Based Synchronization . . . . . . . . . . . . . . . . . . 59 6.1.1 Sparsity on Spectrogram . . . . . . . . . . . . . . . . . . . . 61 6.2 Synchronization Algorithm Design . . . . . . . . . . . . . . . . . . . 68 6.2.1 Significance Determination . . . . . . . . . . . . . . . . . . . 68 6.2.2 Synchronization Module . . . . . . . . . . . . . . . . . . . . . 70 6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.3.1 Codec Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.3.2 Misalignment of Analysis Windows . . . . . . . . . . . . . . 74 6.3.3 Noise Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3.4 Overlapping Speakers . . . . . . . . . . . . . . . . . . . . . . 77 6.3.5 Packet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.3.6 Short Conclusion on Performance . . . . . . . . . . . . . . . 79 CHAPTER 7 CONCLUSIONS AND FUTURE WORK . . . . . . . 80 7.1 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . 80 7.1.1 Codec Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.1.2 Noise Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.1.3 Overlapping Speakers . . . . . . . . . . . . . . . . . . . . . . 82 7.1.4 Packet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
dc.language.iso	en
dc.title	以數位語音處理技術解決異質視訊會議之同步問題	zh_TW
dc.title	On Using Digital Speech Processing Techniques for Synchronization among Heterogeneous Teleconferencing Devices	en
dc.type	Thesis
dc.date.schoolyear	96-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	葉丙成(Ping-Cheng Yeh),周俊廷(Chun-Ting Chou),鄭振牟(Chen-Mou Cheng),高榮鴻(Rung-Hung Gau)
dc.subject.keyword	數位語音處理技術,語音影像同步,異質網路,視訊會議,	zh_TW
dc.subject.keyword	Digital Speech Processing,Audio/Video Synchronization,Heterogeneous Network,Teleconferencing,	en
dc.relation.page	92
dc.rights.note	有償授權
dc.date.accepted	2008-08-07
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電信工程學研究所	zh_TW
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-97-1.pdf 目前未授權公開取用	1.58 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。