Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/42230
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor謝宏昀(Hung-Yun Hsieh)
dc.contributor.authorHsiao-Pu Linen
dc.contributor.author林孝蒲zh_TW
dc.date.accessioned2021-06-15T00:54:13Z-
dc.date.available2008-09-02
dc.date.copyright2008-09-02
dc.date.issued2008
dc.date.submitted2008-08-07
dc.identifier.citation[1] H.-Y. Hsieh, C.-W. Li, S.-W. Liao, Y.-W. Chen, T.-L. Tsai, and H.-P. Lin, “Moving
toward end-to-end support for handoffs across heterogeneous telephony systems
on dual-mode mobile devices,” Elsevier Computer Communications, Special
Issue on End-to-End Support over Heterogeneous Wired-Wireless Networks, article
in press, April 2007.
[2] J. Grudin, E. Steven, and Poltrock, “Videoconferencing: Recent experiments
andreassessment,” in Proceedings of the 38th Annual Hawaii International Conference
on HICSS ’05, pp. 3–6, January 2005.
[3] C.-W. Lin, Y.-C. Chen, and M.-T. Sun, “Dynamic region of interest transcoding
for multipoint video conferencing,” IEEE Transactions on Circuits and Systems
for Video Technology, pp. 982– 992, Octobor 2003.
[4] Akkus, I. Civanlar, and M. Ozkasap, “Peer-to-peer multipoint video conferencing
using layered video,” Signal Processing and Communications Applications, pp.
1–4, April 2006.
[5] VIBO Telecom Inc., “Taiwan 3g- vibo telecom.” Online Available at:
http://www.compression-links.info/
[6] Cutler, Ross G. (Duvall, WA, US) and Bridgewater, Alan L. (Issaquah, WA, US),
“Audio/video synchronization using audio hashing,” no. 20060291478, December
2006. Online Available at: http://www.freepatentsonline.com/20060291478.html
[7] Octave Communications, “Octave communications.” Online Available at:
http://www.octave.in/menu.htm
[8] S. E. Poltrock and J. Grudin, “Videoconferencing: Recent experiments and reassessment,”
IEEE, 2005.
[9] J. Huang, W.-C. Feng, J. Walpole, and W. Jouve, “An experimental analysis of
dct-based approaches for fine-grain multi-resolution video,” Multimedia Systems,
pp. 513–531, January 2006.
[10] K.-T. Fung, Y.-L. Chan, and W.-C. Siu, “Low-complexity and high-quality
frame-skipping transcoder for continuous presence multipoint video conferencing,”
IEEE Transactions on Multimedia, no. 1, pp. 31–46, February 2004.
[11] C.-W. Lin, Y.-C. Chen, and M.-T. Sun, “Dynamic region of interest transcoding
for multipoint video conferencing,” IEEE Transactions on Circuits and Systems
for Video Technology, no. 10, pp. 982–992, October 2003.
[12] M. Chen, G.-M. Su, and M. Wu, “Robust distributed multi-point video conferencing
over error-prone channels,” 2006 IEEE International Conference on
Multimedia and Expo, pp. 1149–1152, July 2006.
[13] M. R. Civanlar, O. Ozkasap, and T. Celebi, “Peer-to-peer multipoint videoconferencing
on the internet,” Signal Processing: Image Communication, pp. 743–754,
May 2005.
[14] I. E. Akkus, M. R. Civanlar, and O. Ozkasap, “Peer-to-peer multipoint video
conferencing using layered video,” Signal Processing and Communications Applications,
2006 IEEE 14th, pp. 1–4, Signal Processing and Communications
Applications, 2006 IEEE 14th 2006.
[15] MpegTV, “Mpeg.org.” Online Available at: http://www.mpeg.org/
[16] C. Liu, Y. Xie, and M. J. Lee, “Multipoint multimedia teleconference system
with adaptive synchronization,” IEEE JSAC, pp. 1422–1435, September 1996.
[17] Y. Xie, C. Liu, M. J. Lee, and Y. N. Saadawi, “Adaptive multimedia synchronization
in a teleconference system,” ACM/Springer Multimedia Systems, no. 4,
pp. 326–337, 1999.
[18] H. Liu and M. E. Zarki, “A synchronization control scheme for real-time streaming
multimedia applications,” in Proceedings of 13th Packet Video Workshop,
April 2003.
[19] ——, “On the adaptive delay and synchronization control for video conferencing
over the internet,” in Proceedings of ICN 2004, March 2004.
[20] ——, “An adaptive delay and synchronization control scheme for wi-fi based
audio/video conferencing,” Springer Science + Business Media, pp. 511–522,
May 2006.
[21] C.-C. Kuo, M.-S. Chen, and J.-C. Chen, “An adaptive transmission scheme for
audio and video synchronization based on real-time transport protocol,” IEEE
International Conference on Multimedia and Expo, pp. 525–528, 2001.
[22] C. Kim, K. deok Seo, W. Sung, and S. heung Jung, “Efficient audio/video synchronization
method for video telephony system in consumer cellular phones,”
ICCE ’06 Consumer Electronics, pp. 137–138, January 2006.
[23] M. Yang, N. Bourbakis, Z. Chen, and M. Trifas, “An efficient audio-video synchronization
methodology,” IEEE International Conference on Multimedia and
Expo, pp. 767–770, July 2007.
[24] WIKIPEDIA, “Lip sync.” Online Available at: {http://en.wikipedia.org/wiki/
Lip sync}
[25] G. Zoric and I. S. Pandzic, “A real-time lip sync system using a genetic algorithm
for automatic neural network configuration,” IEEE International Conference on
Multimedia and Expo, pp. 1366–1369, July 2005.
[26] ——, “Automatic lip sync and its use in the new multimedia services for mobile
devices,” in Proceedings of the 8th International Conference on Telecommunications,
2005, pp. 353–358, June 2005.
[27] W.-N. Lie and H.-C. Hsieh, “Lips detection by morphological image processing,”
in Proceedings of ICSP ’98, pp. 1084–1087, 1998.
[28] D. F. McAllister, R. D. Rodman, D. L. Bitzer, and A. S. Freeman, “Lip synchronization
of speech,” 1998.
[29] GoldWave Inc., “Audio editing software- GoldWave.” Online Available at:
http://www.goldwave.com/
[30] D. L. Mills, “Network time protocol (version 3) specification, implementation,
and analysis,” RFC 1305, 1992.
[31] S.-M. Jun, D.-H. Yu, Y.-H. Kim, and S.-Y. Seong, “A time synchronization
method for ntp,” in RTCSA ’99: Proceedings of the Sixth International Conference
on Real-Time Computing Systems and Applications. Washington, DC,
USA: IEEE Computer Society, 1999, pp. 466–473.
[32] R. Steinmetz, “Human perception for jitter and media synchronization,” IEEE
Journal on Selected Areas in Comm., no. 1, pp. 61–72, January 1996.
[33] H. Manhaiem, R. Silon, G. Fartuk, and S. Refael, Packet Loss Concealment
Techniques and Algorithms.
[34] The MathWorks, Inc, “The MathWorks- MATLAB and Simulink for technical
computing.” Online Available at: http://www.mathworks.com/
[35] H. Boril and P. Pollak, “Direct time domain fundamental frequency estimation
of speech in noisy conditions,” European Signal Processing Conference 2004, pp.
1003–1006, September 2004.
[36] T. Shimamura and H. Kobayashi, “Weighted autocorrelation for pitch extraction
of noisy speech,” IEEE Transactions on Speech and Audio Processing,, pp. 727
– 730, Octobor 2001.
[37] B. V. K. Kumar, A. Mahalanobis, and R. D. Juday, Correlation Pattern Recognition.
Cambridge University Press, 2005.
[38] VoiceAge Corporation, “Open amr initiative- amr codec.” Online Available at:
www.voiceage.com
[39] YUVSoft Corp. and Graphics and Media Lab, “Ultimate compression resources
catalog.” Online Available at: http://www.compression-links.info/
[40] DSPRelated.com, “MATLAB code- Mel Frequency Cepstral Coefficients.”
Online Available at: http://dsprelated.com/
[41] A. M. Reddy and B. Raj, “Soft mask methods for single-channel speaker separation,”
IEEE Trans. Audio, Speech, Lang. Process., no. 6, pp. 1766–1776, August
2007.
[42] P. Smaragdis, “Convolutive speech bases and their application to supervised
speech separation,” IEEE Trans. Audio, Speech, Lang. Process., no. 1, pp. 1–12,
January 2007.
[43] B. Raj and P. Smaragdis, “Latent variable decomposition of spectrograms for
single channel speaker separation,” IEEE Workshop on Applications of Signal
Process. to Audio and Acoustics, October 2005.
[44] D. Chazan, Y. Stettiner, and D. Malah, “Optimal multi-pitch estimation using
the em algorithm for co-channel speech separation,” IEEE Trans. Audio, Speech,
Lang. Process., 1993.
[45] D. P. Morgan, E. B. George, L. T. Lee, and S. M. Kay, “Cochannel speaker
separation by harmonic enhancement and suppression,” IEEE Trans. Audio,
Speech, Lang. Process., no. 5, pp. 407–424, September 1997.
[46] D.-L. Wang and G. J. Brown, Computational Auditory Scene Analysis: Principles,
Algorithms, and Applications. Hoboken, New Jersey: John Wiley and
Sons, 2006.
[47] B. C. J. Moore, An Introduction to the Psychology of Hearing, 5th ed. San
Diego, CA: Academic Press, 2003.
[48] A. Jourjine, S. Richard, and O. Yilmaz, “Blind separation of disjoint orthogonal
signals: Demixing n sources from 2 mixtures,” In Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing, pp. 2985–2988,
June 2000.
[49] S. T. Roweis, “Factorial models and refiltering for speech separation and denoising,”
In Proceedings of Eurospeech, pp. 1009–1012, September 2003.
[50] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via timefrequency
masking,” IEEE Transactions on Signal Processing, no. 7, pp. 1830–
1847, JULY 2004.
[51] Z. Shan, J. Swary, and S. Aviyente, “Underdetermined source separation in the
time-frequency domain,” ICASSP, pp. 945–948, 2007.
[52] A. Aissa-El-Bey, K. Abed-Meraim, and Y. Grenier, “Underdetermined blind separation
of audio sources from the time-frequency representation of their convolutive
mixtures,” ICASSP, pp. 153–156, September 2007.
[53] L. T. Nguyen, A. Belouchrani, K. Abed-Meraim, and B. Boashash, “Separating
more sources than sensors using time-frequency distributions,” ISSPA, pp. 583–
586, August 2001.
[54] S. Rickard, R. Balan, and J. Rosca, “Real-time time-frequency based
blind source separation,” pp. 651–656, May 2001. Online Available at:
http://citeseer.ist.psu.edu/rickard01realtime.html
[55] P. Bofill and M. Zibulevsky, “Blind separation of more sources than mixtures
using sparsity of their short-time fourier transform,” in Proc. Int. Workshop
Independent Component Anal. Blind Source Separation, pp. 87–92, June 2000,
helsinki, Finland.
[56] S. Rickard and O. Yilmaz, “On the approximate W-Disjoint Orthogonality of
speech,” ICASSP, pp. 13–17, May 2002.
[57] M. Aoki, M. Okamoto, S. Aoki, H. Matsui, T. Sakurai, and Y. Kaneda, “Sound
source segregation based on estimating incident angle of each frequency component
of input signals acquired by multiple microphones,” Acoust. Sci. Technol.,
no. 2, pp. 149–157, February 2001.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/42230-
dc.description.abstractAs the popularity of multi-functional telephony devices grows, traditional audio conference now may involve heterogeneous teleconferencing devices, including POTS
phone, dual-mode smart phones, pocket PCs, and so on. Among these conferencing devices, some may have the capability of accessing IP networks and supporting video conferencing with peer devices in the audio conference so as to have better conferencing experience. In this scenario, it becomes necessary to synchronize between audio streams, traversed the PSTN network, and video streams, traversed the IP network.
While related work has investigated the problem of audio/video synchronization, their scenario is limited to the synchronization within homogeneous network, hence they
cannot be applied in the target scenario.
Therefore, in this thesis we propose an end-to-end framework for audio/video synchronization. We then simplify the problem as one that requires only synchronization between PSTN and IP audio streams. We first employ a time-domain algorithm based
on cross correlation and identify its ineffectiveness in synchronizing distorted audio streams, due to noises or packet losses. Hence, we seek to extract distortion-tolerant
audio features by Digital Speech Processing techniques for synchronization. We apply MFCC in the synchronization algorithm and obtain respectable performance for audio streams distorted by codec and packet losses. However, MFCC is inherently vulnerable to overlapping speakers. Therefore, we leverage the sparsity of speeches in spectrograms to design the spectrogram-based synchronization algorithm, and achieve favorable performance for speech mixtures and noisy speech. Evaluation results show that using DSP techniques is helpful in solving the synchronization problem across
PSTN audio streams and IP video streams in terms of accuracy and robustness.
en
dc.description.provenanceMade available in DSpace on 2021-06-15T00:54:13Z (GMT). No. of bitstreams: 1
ntu-97-R95942032-1.pdf: 1618719 bytes, checksum: d0073e8ae636cb3aeaa8aa2c1ba3282b (MD5)
Previous issue date: 2008
en
dc.description.tableofcontentsABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Heterogeneous Teleconferencing Scenario . . . . . . . . . . . . . . . 6
2.1.1 Audio Conference Architecture . . . . . . . . . . . . . . . . . 6
2.1.2 Video Conference Architecture . . . . . . . . . . . . . . . . . 8
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Conventional IP Audio-Video Synchronization . . . . . . . . 10
2.2.2 Lip Synchronization . . . . . . . . . . . . . . . . . . . . . . . 12
CHAPTER 3 A FRAMEWORK FOR AUDIO-VIDEO SYNCHRONIZATION
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Synchronization Framework . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Concept of Simplification . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Proposed Framework . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Asynchronism Measurement . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Measurement Results . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Challenges of Audio Synchronization . . . . . . . . . . . . . . . . . . 23
3.3.1 Distortion by Voice Codec . . . . . . . . . . . . . . . . . . . 24
3.3.2 Distortion by Noise . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 Interference by Other Conferees . . . . . . . . . . . . . . . . 24
3.3.4 Packet Loss in Wireless Connection . . . . . . . . . . . . . . 25
3.3.5 Reactiveness to Network Dynamics . . . . . . . . . . . . . . 25
CHAPTER 4 SYNCHRONIZATION BASED ON CROSS CORRELATION
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Time-domain Audio Features . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Cross-Correlation-Based Synchronization . . . . . . . . . . . . . . . 29
4.2.1 Basics of Cross Correlation . . . . . . . . . . . . . . . . . . . 29
4.2.2 Cross-Correlation Synchronization Module . . . . . . . . . . 29
4.3 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Matching Window Size . . . . . . . . . . . . . . . . . . . . . 31
4.3.2 Search Step Size . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.3 Short Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.1 Codec Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4.2 Noise Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.3 Overlapping Speakers . . . . . . . . . . . . . . . . . . . . . . 41
4.4.4 Packet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.5 Short Conclusion on Performance . . . . . . . . . . . . . . . 43
CHAPTER 5 SYNCHRONIZATION BASED ON MFCC . . . . . 45
5.1 MFCC-Based Synchronization . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 Basics of MFCC . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Synchronization Algorithm Design . . . . . . . . . . . . . . . . . . . 48
5.2.1 Mathematical Analysis . . . . . . . . . . . . . . . . . . . . . 49
5.2.2 Similarity Metric . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Codec Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.2 Misalignment of Analysis Windows . . . . . . . . . . . . . . 52
5.3.3 Noise Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.4 Overlapping Speakers . . . . . . . . . . . . . . . . . . . . . . 55
5.3.5 Packet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.6 Short Conclusion on Performance . . . . . . . . . . . . . . . 57
CHAPTER 6 SYNCHRONIZATION BASED ON SPECTROGRAM 59
6.1 Spectrogram-Based Synchronization . . . . . . . . . . . . . . . . . . 59
6.1.1 Sparsity on Spectrogram . . . . . . . . . . . . . . . . . . . . 61
6.2 Synchronization Algorithm Design . . . . . . . . . . . . . . . . . . . 68
6.2.1 Significance Determination . . . . . . . . . . . . . . . . . . . 68
6.2.2 Synchronization Module . . . . . . . . . . . . . . . . . . . . . 70
6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3.1 Codec Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3.2 Misalignment of Analysis Windows . . . . . . . . . . . . . . 74
6.3.3 Noise Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3.4 Overlapping Speakers . . . . . . . . . . . . . . . . . . . . . . 77
6.3.5 Packet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3.6 Short Conclusion on Performance . . . . . . . . . . . . . . . 79
CHAPTER 7 CONCLUSIONS AND FUTURE WORK . . . . . . . 80
7.1 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1.1 Codec Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1.2 Noise Distortion . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1.3 Overlapping Speakers . . . . . . . . . . . . . . . . . . . . . . 82
7.1.4 Packet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
dc.language.isoen
dc.title以數位語音處理技術解決異質視訊會議之同步問題zh_TW
dc.titleOn Using Digital Speech Processing Techniques for Synchronization among Heterogeneous Teleconferencing Devicesen
dc.typeThesis
dc.date.schoolyear96-2
dc.description.degree碩士
dc.contributor.oralexamcommittee葉丙成(Ping-Cheng Yeh),周俊廷(Chun-Ting Chou),鄭振牟(Chen-Mou Cheng),高榮鴻(Rung-Hung Gau)
dc.subject.keyword數位語音處理技術,語音影像同步,異質網路,視訊會議,zh_TW
dc.subject.keywordDigital Speech Processing,Audio/Video Synchronization,Heterogeneous Network,Teleconferencing,en
dc.relation.page92
dc.rights.note有償授權
dc.date.accepted2008-08-07
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電信工程學研究所zh_TW
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-97-1.pdf
  目前未授權公開取用
1.58 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved