Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 工學院
  3. 工程科學及海洋工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/49633
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor黃乾綱(Chien-Kang Huang)
dc.contributor.authorYu-Chi Linen
dc.contributor.author林玉琪zh_TW
dc.date.accessioned2021-06-15T11:38:51Z-
dc.date.available2016-09-13
dc.date.copyright2016-09-13
dc.date.issued2016
dc.date.submitted2016-08-15
dc.identifier.citation1. Rosten, E. and T. Drummond, Machine learning for high-speed corner detection, in Computer Vision–ECCV 2006. 2006, Springer. p. 430-443.
2. Leutenegger, S., M. Chli, and R.Y. Siegwart. BRISK: Binary robust invariant scalable keypoints. in Computer Vision (ICCV), 2011 IEEE International Conference on. 2011. IEEE.
3. Lee, B., et al. AVICAR: audio-visual speech corpus in a car environment. in INTERSPEECH. 2004. Citeseer.
4. Akdemir, E. and T. Ciloglu, Bimodal automatic speech segmentation based on audio and visual information fusion. Speech Communication, 2011. 53(6): p. 889-902.
5. Shaikh, A.A., D.K. Kumar, and J. Gubbi, Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments. The Visual Computer, 2012. 29(10): p. 969-982.
6. McCool, C., et al. Bi-modal person recognition on a mobile phone: using mobile phone data. in Multimedia and Expo Workshops (ICMEW), 2012 IEEE International Conference on. 2012. IEEE.
7. Talea, H. and K. Yaghmaie. Automatic visual speech segmentation. in Communication Software and Networks (ICCSN), 2011 IEEE 3rd International Conference on. 2011.
8. Mak, M.W. and W.G. Allen, Lip-Motion Analysis for Speech Segmentation in Noise. Speech Communication, 1994. 14(3): p. 279-296.
9. Lowe, D.G., Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004. 60(2): p. 91-110.
10. Zhao, G.Y., M. Barnard, and M. Pietikainen, Lipreading With Local Spatiotemporal Descriptors. Ieee Transactions on Multimedia, 2009. 11(7): p. 1254-1265.
11. Jones, R.J., S. Downey, and J.S. Mason. Continuous Speech Recognition Using Syllables. in Fifth European Conference on Speech Communication and Technology. 1997.
12. Jalil, M., F.A. Butt, and A. Malik. Short-time energy, magnitude, zero crossing rate and autocorrelation measurement for discriminating voiced and unvoiced segments of speech signals. in Technological Advances in Electrical, Electronics and Computer Engineering (TAEECE), 2013 International Conference on. 2013.
13. Shete, D., S. Patil, and S. Patil, Zero crossing rate and Energy of the Speech Signal of Devanagari Script. IOSR-JVSP, 2014. 4(1): p. 1-5.
14. Panda, S.P. and A.K. Nayak, Automatic speech segmentation in syllable centric speech recognition system. International Journal of Speech Technology, 2015: p. 1-10.
15. Alani, A. and M. Deriche. A novel approach to speech segmentation using the wavelet transform. in Signal Processing and Its Applications, 1999. ISSPA'99. Proceedings of the Fifth International Symposium on. 1999. IEEE.
16. Erçelebi, E., Second generation wavelet transform-based pitch period estimation and voiced/unvoiced decision for speech signals. Applied Acoustics, 2003. 64(1): p. 25-41.
17. Joseph, S.M. and A.P. Babu, Wavelet energy based voice activity detection and adaptive thresholding for efficient speech coding. International Journal of Speech Technology, 2016: p. 1-14.
18. Harris, C. and M. Stephens. A combined corner and edge detector. in Alvey vision conference. 1988. Citeseer.
19. Bay, H., T. Tuytelaars, and L. Van Gool, Surf: Speeded up robust features, in Computer vision–ECCV 2006. 2006, Springer. p. 404-417.
20. Rublee, E., et al. ORB: an efficient alternative to SIFT or SURF. in Computer Vision (ICCV), 2011 IEEE International Conference on. 2011. IEEE.
21. Juan, L. and O. Gwun, A comparison of sift, pca-sift and surf. International Journal of Image Processing (IJIP), 2009. 3(4): p. 143-152.
22. Calonder, M., et al., BRIEF: Binary Robust Independent Elementary Features, in Computer Vision – ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV, K. Daniilidis, P. Maragos, and N. Paragios, Editors. 2010, Springer Berlin Heidelberg: Berlin, Heidelberg. p. 778-792.
23. Song, F. and B. Lu. An automatic video image mosaic algorithm based on SIFT feature matching. in Proceedings of the 2012 International Conference on Communication, Electronics and Automation Engineering. 2013. Springer.
24. Zeng, L., et al., Dynamic image mosaic via SIFT and dynamic programming. Machine vision and applications, 2014. 25(5): p. 1271-1282.
25. Bicego, M., et al. On the use of SIFT features for face authentication. in Computer Vision and Pattern Recognition Workshop, 2006. CVPRW'06. Conference on. 2006. IEEE.
26. Geng, C. and X. Jiang. Face recognition using sift features. in Image Processing (ICIP), 2009 16th IEEE International Conference on. 2009. IEEE.
27. Kisku, D.R., et al. Face Identification by SIFT-based Complete Graph Topology. in Automatic Identification Advanced Technologies, 2007 IEEE Workshop on. 2007.
28. Li, Z.F., U. Park, and A.K. Jain, A Discriminative Model for Age Invariant Face Recognition. Ieee Transactions on Information Forensics and Security, 2011. 6(3): p. 1028-1037.
29. Koprinska, I. and S. Carrato, Temporal video segmentation: A survey. Signal Processing-Image Communication, 2001. 16(5): p. 477-500.
30. Farnebäck, G., Two-frame motion estimation based on polynomial expansion, in Image analysis. 2003, Springer. p. 363-370.
31. Zhou, Z., et al., A review of recent advances in visual speech decoding. Image and Vision Computing, 2014. 32(9): p. 590-605.
32. Zue, V., S. Seneff, and J. Glass, Speech Database Development at Mit - Timit and Beyond. Speech Communication, 1990. 9(4): p. 351-356.
33. Matthews, I., et al., Extraction of visual features for lipreading. Ieee Transactions on Pattern Analysis and Machine Intelligence, 2002. 24(2): p. 198-213.
34. Patterson, E.K., et al. CUAVE: A new audio-visual database for multimodal human-computer interface research. in Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on. 2002. IEEE.
35. Cooke, M., et al., An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 2006. 120(5): p. 2421-2424.
36. Messer, K., et al. XM2VTSDB: The Extended M2VTS Database. in International Conference on Audio- and Video-Based Person Authentication. 1999.
37. Dalka, P., P. Bratoszewski, and A. Czyzewski. Visual lip contour detection for the purpose of speech recognition. in Signals and Electronic Systems (ICSES), 2014 International Conference on. 2014. IEEE.
38. Ibrahim, M. and D. Mulvaney. Robust geometrical-based lip-reading using Hidden Markov models. in EUROCON, 2013 IEEE. 2013. IEEE.
39. Liu, Y.-F., C.-Y. Lin, and J.-M. Guo, Impact of the Lips for Biometrics. Image Processing, IEEE Transactions on, 2012. 21(6): p. 3092-3101.
40. WenJuan, Y., L. YaLing, and D. MingHui. A real-time lip localization and tacking for lip reading. in Advanced Computer Theory and Engineering (ICACTE), 2010 3rd International Conference on. 2010. IEEE.
41. Viola, P. and M.J. Jones, Robust real-time face detection. International journal of computer vision, 2004. 57(2): p. 137-154.
42. Markuš, N., et al., Eye pupil localization with an ensemble of randomized trees. Pattern Recognition, 2014. 47(2): p. 578-587.
43. Jung, C., T. Sun, and L. Jiao, Eye detection under varying illumination using the retinex theory. Neurocomputing, 2013. 113: p. 130-137.
44. Valenti, R. and T. Gevers. Accurate eye center location and tracking using isophote curvature. in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. 2008. IEEE.
45. Yan, L., et al., A Lip Localization Method Based on HSV Transformation in Smart Phone Environment. Signal Processing (ICSP), 2014: p. 1285-1290.
46. Li, A. and Y. Zu, Speaking rate effects on discourse prosody in standard Chinese. Proceedings of Speech Prosody. pp, 2008. 449: p. 452.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/49633-
dc.description.abstract連續語音辨識系統的訊號處理方式可分為「完整訊號辨識」及「音節辨識」兩類。其中以音節為主的辨識方法能降低雜訊干擾,針對能量集中的訊號區間進行分析。進行音節辨識的首要步驟為偵測準確的音節邊界位置,但口語語音訊號中常見的連音現象使其無法順利取得正確的音節邊界。故本研究針對連續中文語料的脣形影像進行音節邊界偵測,以不同脣形間的轉換做為偵測音節邊界的關鍵資訊。
本研究提出的方法首先以SIFT定位人臉位置,而後採用密集光流法計算相鄰影格間的脣形影像變動量,以此變動量做為偵測音節邊界的依據。在本研究錄製的影音檔中,聲音及影像兩者為同步錄製而成,因此本研究在語音層面偵測的音節邊界結果中,引入影像偵測的音節邊界以提升正確率。實驗結果顯示在語音層面無法順利切分的音節中,有過半的音節邊界能由脣形影像變動量分析取得。結合兩者的音節邊界偵測資訊將能增強系統在嘈雜環境中的穩定度,使系統適應不同噪音的現實環境。本研究另錄製以中文連續數字序列為語料的影像資料庫,做為偵測音節邊界的實驗資料。在資料庫中的受試者共有40位;影片總數為2,480部 (包含閱讀階段及自然語速階段),此資料庫將公開於學術領域以促進中文脣語辨識的發展。
zh_TW
dc.description.abstractAutomatic speech recognition (ASR) researches with video can be split into two categories which are based on signal processing methods. One is recognizing signals thoroughly from streaming video, the other is recognizing the signals on the syllable basis. The latter approach analyzes the energy centralized region to reduce noise interference. Recognizing syllables requires detecting the syllable boundaries correctly from continuous speech signals. To achieve this goal, this study focuses on detecting syllable boundaries contained in the continuous Mandarin corpus of the lip images. The transitions between different lip shapes are the key information in detecting syllable boundaries.
The algorithm proposed in this research firstly locates the face positions, then dense optical flow is adopted to calculate the lip images variance between every two neighboring video frames. It is the basis that using this variance to detect syllable boundaries via lip images in continuous video frames. Since the audio and video are simultaneously recorded in this research, it is reasonable to assume the boundaries of two adjacent syllables should be seen from image information. The experiment result shows that more than half of the syllable boundaries can be extracted with the variance of lip images when audio signals of the syllables are inseparable from their energy distribution. Using both audio and video channels not only helps raise the stability in detection syllable boundaries, but also make the system robust to resist noisy environment. Furthermore, the recorded database of this study which consists of 2,480 clips (both reading & speaking) by 40 informants will be opened for download to promote academic research in Mandarin continuous speech recognition.
en
dc.description.provenanceMade available in DSpace on 2021-06-15T11:38:51Z (GMT). No. of bitstreams: 1
ntu-105-R03525064-1.pdf: 4568203 bytes, checksum: 7cd4e4da46a59bfa1296ce77402bdeb3 (MD5)
Previous issue date: 2016
en
dc.description.tableofcontents致謝 i
中文摘要 ii
Abstract iii
目錄 iv
圖目錄 vi
表目錄 viii
第一章 緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 本研究於學術領域之貢獻 2
1.4 論文架構 3
第二章 文獻回顧 4
2.1 脣形音節切分技術探討 4
2.2 影像局部特徵點偵測技術 7
2.2.1 局部特徵點偵測技術回顧 7
2.2.2 尺度不變特徵轉換 11
2.3 影像變動量偵測技術 16
2.3.1 變動量偵測技術回顧 16
2.3.2 密集光流法 18
2.4 影像資料庫整理 22
2.5 基礎影像與訊號處理介紹 25
2.5.1 數學形態學 25
2.5.2 影像連通體 27
第三章 影像音節擷取演算法 28
3.1 問題定義及演算法流程 28
3.2 連續中文數字資料庫蒐集 30
3.3 即時人臉偵測技術 31
3.3.1 即時人臉偵測流程介紹 32
3.3.2 最佳特徵對的搜尋方法 33
3.4 影像音節切分方法 36
3.4.1 擷取脣形變動量流程介紹 36
3.4.2 密集光流法於相鄰影像之變動量估計 37
3.4.3 脣形變動量計算 43
3.4.4 影像音節分析 45
第四章 實驗結果與討論 49
4.1 實驗影片及人工音節邊界 (Ground truth) 49
4.2 影像音節偵測 51
4.2.1 影像音節偵測正確率統計 51
4.2.2 本研究與近期文獻的正確率比較 54
4.3 影像音節與語音音節混合分析 57
4.4 [補充實驗] SIFT+RMSD人臉偵測與Adaboost分類器結果比較 61
4.5 [補充實驗] 眼睛偵測率結果統計 63
第五章 結論 65
參考文獻 67
附錄一 受試者影像 70
附錄二 實驗數組 75
附錄三 受試者影音資料授權書 76
dc.language.isozh-TW
dc.subject自動音節邊界偵測zh_TW
dc.subject密集光流法zh_TW
dc.subject人臉定位zh_TW
dc.subject連續語音辨識zh_TW
dc.subject連續唇語辨識zh_TW
dc.subject自動音節分割zh_TW
dc.subjectFace localizationen
dc.subjectContinuous lip readingen
dc.subjectContinuous speech recognitionen
dc.subjectAutomatic syllable boundary detectionen
dc.subjectAutomatic syllable segmentationen
dc.subjectDense optical flowen
dc.title基於脣形光流偵測之連續中文音節邊界自動擷取方法zh_TW
dc.titleLip Optical-flow Driven Automatic Continuous Speech Syllabification in Mandarinen
dc.typeThesis
dc.date.schoolyear104-2
dc.description.degree碩士
dc.contributor.oralexamcommittee傅楸善(Chiou-Shann Fuh),張恆華(Heng-Hua Chang)
dc.subject.keyword連續唇語辨識,連續語音辨識,自動音節邊界偵測,自動音節分割,密集光流法,人臉定位,zh_TW
dc.subject.keywordContinuous lip reading,Continuous speech recognition,Automatic syllable boundary detection,Automatic syllable segmentation,Dense optical flow,Face localization,en
dc.relation.page76
dc.identifier.doi10.6342/NTU201602685
dc.rights.note有償授權
dc.date.accepted2016-08-16
dc.contributor.author-college工學院zh_TW
dc.contributor.author-dept工程科學及海洋工程學研究所zh_TW
顯示於系所單位:工程科學及海洋工程學系

文件中的檔案:
檔案 大小格式 
ntu-105-1.pdf
  未授權公開取用
4.46 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved