Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/64129
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor李琳山(Lin-shan Lee)
dc.contributor.authorChun-an Chanen
dc.contributor.author詹竣安zh_TW
dc.date.accessioned2021-06-16T17:31:14Z-
dc.date.available2012-12-31
dc.date.copyright2012-08-19
dc.date.issued2012
dc.date.submitted2012-08-15
dc.identifier.citationReferences
[1] NIST. The spoken term detection (STD) 2006 evaluation plan, 10th ed. [Online]. Available: http://www.nist.gov/speech/tests/std
[2] D. R. H. Miller et al., “Rapid and accurate spoken term detection,” in INTERSPEECH, 2007.
[3] J. Mamou, B. Ramabhadran, and O. Siohan, “Vocabulary independent spoken term detection,” in Proc. ACM-SIGIR, 2007.
[4] R. Wallace, R. Vogt, and S. Sridharan, “A phonetic search approach to the 2006 NIST spoken term detection evaluation,” in INTERSPEECH, 2007.
[5] M. Terao, T. Koshinaka, S. Ando, R. Isotani, and A. Okumura, “Open-vocabulary spoken-document retrieval based on query expansion using related web documents,” in INTERSPEECH, 2008.
[6] W. Shen, C. M. White, and T. J. Hazen, “A comparison of query-by-example methods for spoken term detection,” in INTERSPEECH, 2009.
[7] C. Parada, A. Sethy, and B. Ramabhadran, “Query-by-example spoken term detection for OOV terms,” in ASRU, 2009.
[8] Y.-c. Pan and L.-s. Lee, “Performance analysis for lattice-based speech indexing approaches using word and subword units,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, August 2010.
[9] M. Saraclar and R. Sproat, “Lattice-based search for spoken utterance retrieval,” in HLT-NAACL, 2004.
[10] A. Garcia and H. Gish, “Keyword spotting of arbitrary words using minimal speech resources,” in ICASSP, 2006.
[11] S. Novotney, R. Schwartz, and J. Ma, “Unsupervised acoustic and language model training with small amounts of labelled data,” in ICASSP, 2009.
[12] L. Boves, R. Carlson, E. W. Hinrichs, D. House, S. Krauwer, L. Lemnitzer, M. Vainio, and P. Wittenburg, “Resources for speech research: present and future infrastructure needs,” in INTERSPEECH, 2009.
[13] A. Kumar, N. Rajput, D. Chakraborty, S. K. Agarwal, and A. A. Nanavati, “WWTW: the world wide telecom web,” in Proceedings of the 2007 workshop on networked systems for developing regions, 2007.
[14] F. Metze, N. Rajput et al., “The spoken web search task at Mediaeval 2011,” in ICASSP, 2012.
[15] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 1, February 1987.
[16] C. Myers, L. Rabiner, and A. Rosenberg, “Performance tradeoffs in dynamic time warping algorithms for isolated word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, pp. 623–635, 1980.
[17] A. Higgins and R. Wohlford, “Keyword recognition using template concatenation,” in ICASSP, 1985.
[18] A. S. Park and J. R. Glass, “Unsupervised pattern discovery in speech,” IEEE Transactions on Audio, Speech and Language Processing, vol. 16, 2008.
[19] Y. Zhang and J. R. Glass, “Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams,” in ASRU, 2009.
[20] M. Huijbregts, M. McLaren, and D. van Leeuwen, “Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection,” in ICASSP, 2011.
[21] H. Wang, C.-C. Leung, T. Lee, B. Ma, and H. Li, “An acoustic segment modeling approach to query-by-example spoken term detection,” in ICASSP, 2012.
[22] M. A. Carlin, S. Thomas, A. Jansen, and H. Hermansky, “Rapid evaluation of speech representations for spoken term discovery,” in INTERSPEECH, 2011.
[23] T. J. Hazen, W. Shen, and C. White, “Query-by-example spoken term detection using phonetic posteriorgram templates,” in ASRU, 2009.
[24] A. Muscariello, G. Gravier, and F. Bimbot, “Zero-resource audio-only spoken term detection based on a combination of template matching techinques,” in INTERSPEECH, 2011.
[25] M. Miller, Information retrieval for music and motion. Springer, 2007, ch. 4.4, pp. 79–81.
[26] L. F. Lamela, R. H. Kassel, and S. Seneff, “Speech database development: Design and analysis of the acoustic-phonetic corpus,” in Proceeding of the DARPA Speech Recognition Workshop, 1986.
[27] E. M. Voorhees, “Overview of the TREC 2006,” in TREC, 2006.
[28] A. K. Halberstadt, “Heterogeneous acoustic measurements and multiple classifiers for speech recognition,” Ph.D. dissertation, 1999.
[29] C.-a. Chan and L.-s. Lee, “Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping,” in INTERSPEECH, 2010.
[30] ——, “Integrating frame-based and segment-based dynamic time warping for unsupervised spoken term detection with spoken queries,” in ICASSP, Prague, May 2011.
[31] Y. Qiao, N. Shimomura, and N. Minematsu, “Unsupervised optimal phoneme segmentation: objectives, algorithm and comparisons,” in ICASSP, 2008.
[32] C.-a. Chan and L.-s. Lee, “Model-based unsupervised spoken term detection with spoken queries,” IEEE Transactions on Audio, Speech and Language Processing, 2012, in revision.
[33] ——, “Unsupervised hidden markov modeling of spoken queries for spoken term detection without speech recognition,” in INTERSPEECH, 2011.
[34] C.-H. Lee, F. K. Soong, and B.-H. Juang, “A segment model based approach to speech recognition,” in ICASSP, 1988.
[35] J. Reed and C.-H. Lee, “A study on music genre classification based on universal acoustic models,” in ISMIR, 2006, pp. 89–94.
[36] H. Li, B. Ma, and C.-H. Lee, “A vector space modeling approach to spoken language identification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 1, January 2007.
[37] Y. Tsao, H. Sun, H. Li, and C.-H. Lee, “An acoustic segment model approach to incorporating temporal information into speaker modeling for text-independent speaker recognition,” in ICASSP, 2010.
[38] C. A. Bouman, “Cluster: An unsupervised algorithm for modeling Gaussian mixtures,” April 1997, available from http://www.ece.purdue.edu/ ̃bouman.
[39] A. Jansen and K. Church, “Towards unsupervised training of speaker independent acoustic models,” in INTERSPEECH, 2011, pp. 1693–1692.
[40] M. Ostendorf, V. Digalakis, and O. A. Kimball, “From HMMs to segment models: A unified view of stochastic modeling for speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 4, pp. 360–378, 1995.
[41] M.-W. Koo, C.-H. Lee, and B.-H. Juang, “Speech recognition and utterance verification based on a generalized confidence score,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 8, pp. 821–832, 2001.
[42] L. ten Bosch, H. V. hamme, and L. Boves, “A computational model of language acquisition: focus on word discovery,” in INTERSPEECH, 2008, pp. 2570–2573.
[43] A. Muscariello, G. Gravier, and F. Bimbot, “Audio keyword extraction by unsupervised word discovery,” in INTERSPEECH, 2009, pp. 2843–2846.
[44] Y. Zhang and J. R. Glass, “Towards multi-speaker unsupervised speech pattern discovery,” in ICASSP, 2010, pp. 4366–4369.
[45] A. Jansen, K. Church, and H. Hermansky, “Towards spoken term discovery at scale with zero resources,” in INTERSPEECH, 2010.
[46] A. Muscariello, G. Gravier, and F. Bimbot, “Towards robust word discovery by self-similarity matrix comparison,” in ICASSP, 2011, pp. 5640–5643.
[47] A. Jansen and B. V. Durme, “Efficient spoken term discovery using randomized algorithms,” in ASRU, 2011.
[48] M. Sun and H. V. hamme, “Unsupervised vocabulary discovery using non-negative matrix factorization with graph regularization,” in ICASSP, 2011, pp. 5152–5155.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/64129-
dc.description.abstract隨著多媒體及網路技術的發展成熟,帶有知識及資訊的語音文件數量也每天飛快地增加,從會議、課程、演講及網路上的各式各樣的影片,許多資訊都是以影音的型式存在;此時只要找到聲音,就可以找到有用的影音資訊。在語音文件檢索中最重要的關鍵技術即為口語詞彙偵測(spoken term detection),其目的是從語音文件中找到完全相符於使用者輸入的查尋詞(query term)。常見的查尋詞是以文字形式存在,因此以文字搜尋語句時不可避免的使用自動語音辨識(automatic speech recognition)。近年興起的智慧型手機又讓語音文件搜尋有了新的可能,因為手機的文字輸入不若電腦來得容易,因此口語查尋詞(spoken query)便是一種新且常見的輸入形式,引發了許多非督導式(unsupervised)口語詞彙偵測的研究。因為此時可以直接拿口語查詢詞和語音文件在聲音訊號上比對,未必需要作語音辨識。此類方法的特點是---不再受限於語音辨識的諸多問題,更不需要人工標記的訓練語料,即使不存在語音辨識系統,甚至該語言不存在文字形式,搜尋相似的詞彙仍然可行。
在本論文中,我們提出了兩類非督導式口語詞彙偵測的方法:以動態時間校正(dynamic time warping, DTW)和以模型為基礎(model-based)之方法。我們點出在傳統的片斷動態時間校正法(segmental DTW)中的兩個主要問題:無法處理較大的語速差異(speaking rate distortion)和運算量太高。我們提出使用斜率限制之動態時間校正法(slope-constrained DTW)來解決語速問題,再使用聲學片段(acoustic segment)取代語音音框來表示語音訊號,以及以聲學片段為單位之動態時間校正法(segment-based DTW),如此可以大量減少所需之運算量。進一步使用兩階段口語詞彙偵測,我們可以在很短的時間內達到比傳統方法更好的偵測效能。我們再使用虛擬相關回饋(pseudo relevance feedback)的方法能使偵測正確率更好。
同時我們也提出一套產生聲學片段模型(acoustic segment model)的方法,用此聲學片段模型來描述重覆出現的語音標型(pattern)在聲學空間中的分佈。藉由聲學片段模型,我們提出將文件轉換成模型序列,再用語音查尋詞找尋相似的模型序列片段,如此不但能用宏觀的語言結構來描述文件,也可以大幅減少搜尋的時間。相同的,我們也在虛擬相關回饋架構中設計了虛擬概似比(pseudo likelihood ratio)檢驗,來驗正已搜集到的候選口語詞彙是否正確。這些方法的效能開啟了一個非督導式語音搜尋的新方向---以隱藏馬可夫模型(hidden Markov model)為基礎的方法,許多在語音辨識發展成熟的技術都可能在未來應用在此領域之中。
最後我們在虛擬相關回饋架構中檢驗整合動態時間校正法和模型法二者之系統效能,實驗顯示我們可以用23\%的時間讓偵測平均準確率(Mean Average Precision)進步14.2\%。
zh_TW
dc.description.abstractUnsupervised spoken term detection (STD) with spoken queries is a new and important topic in multimedia retrieval. The unsupervised approaches without the need of annotated data bypass various problems in speech recognition particularly the recognition errors under different acoustic and linguistic conditions. Such approaches even make searching for spoken terms possible in low-resourced languages or languages without writing system. In this dissertation, we propose several techniques to solve the problem of unsupervised STD problem with spoken queries.
We propose two improved DTW-based approaches to handle the speaking rate distortion and computation efficiency issues in the conventional segmental DTW approach. The Slope-Constrained Dynamic Time Warping (SC-DTW) approach is developed to handle the speaking rate distortion problem. The segment-based DTW approach is devised to reduce the computational burden. The concatenation of these two approaches and the Weighted Pseudo Similarity of SC-DTW approach in the Pseudo Relevance Feedback (PRF) framework show significant improvement on both detection and efficiency performances.
We also propose two model-based approaches for unsupervised STD. We design procedures to construct a set of Acoustic Segment Models (ASMs) that describes the patterns and structures of the target language. In this way, the signal trajectory modeling techniques can be leveraged using the ASMs. Using the ASMs, we propose the Document State Matching (DSM) approach to match spoken queries to the ASM states in the documents. The Duration-Constrained Viterbi algorithm is developed in the DSM approach. Another Pseudo Likelihood Ratio approach is proposed to verify the hypotheses in the PRF framework. Experimental results show that the model-based approaches achieve comparable detection performances in much smaller computation time. Our attempt of migrating from DTW-based approaches to model-based approaches creates the possibilities of leveraging well-developed model-based speech processing techniques in unsupervised STD.
Finally, we tested various approach integration configurations in our system. With the combined model-based and DTW-based approaches, a 14.2\% of absolute Mean Average Precision improvement was achieved using only 23\% of CPU time on the Mandarin broadcast news corpus.
en
dc.description.provenanceMade available in DSpace on 2021-06-16T17:31:14Z (GMT). No. of bitstreams: 1
ntu-101-F95942047-1.pdf: 2765108 bytes, checksum: 10a0f8618d0ea16414525f46d1bf3a7f (MD5)
Previous issue date: 2012
en
dc.description.tableofcontentsContents
口試委員會審定書 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
誌謝 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
中文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Unsupervised Spoken Term Detection with Spoken Queries . . . . . . . . . . . . . 1
1.2 Major Contributions of This Dissertation . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background Review and Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Segmental Dynamic Time Warping with Gaussian Posteriorgrams . . . . . . . . . . 6
2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Mandarin Broadcast News (News98) . . . . . . . . . . . . . . . . . . . . 12
2.2.3 TIMIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Improved DTW-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Slope-Constrained DTW (SC-DTW) . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Speaking Rate Distortion Issue . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Slope Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 Partial Matching for SC-DTW . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Segment-based DTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Efficiency Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 Hierarchical Agglomerative Clustering for Acoustic Segments . . . . . . 19
3.3.3 DTW with Acoustic Segments (ASs) — Segment-based DTW . . . . . . . . . . 21
3.3.4 DTW Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.5 Supersegment Distance . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Cascading Segment-based DTW and SC-DTW . . . . . . . . . . . . . . . . . . . . . 25
3.5 Weighted Pseudo Similarity (WPS) of SC-DTW . . . . . . . . . . . . . . . . . . . 27
3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6.1 Experimental Result (I): SC-DTW vs. Segmental DTW . . . . . . . . . . . 29
3.6.2 Experimental Result (II): Cascading Segment-based DTW and
SC-DTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6.3 Experimental Result (III): WPS . . . . . . . . . . . . . . . . . . . . . 32
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Model-based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Acoustic Segment Model (ASM) . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 Acoustic Segments (ASs) . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.2 AS Clustering and Labeling . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Document State Matching (DSM) . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 Duration-Constrained Viterbi (DC-Vite) . . . . . . . . . . . . . . . . . . 45
4.4 Pseudo Likelihood Ratio (PLR) Approach . . . . . . . . . . . . . . . . . . . . . 49
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5.1 Experimental Results I: Conventional Viterbi vs. DC-Vite . . . . . . . . . 53
4.5.2 Experimental Results II: PLR . . . . . . . . . . . . . . . . . . . . . . . 55
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Integrating DTW-based and Model-based Approaches . . . . . . . . . . . . . . . . . . . 58
5.1 Concatenating Approximated DC-Vite and SC-DTW . . . . . . . . . . . . . . . . . . 58
5.2 WPS vs. PLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Appendix A: HAC with Minimum SSE Criterion . . . . . . . . . . . . . . . . . . . . . . . 73
dc.language.isoen
dc.subject資訊檢索zh_TW
dc.subject口語詞彙偵側zh_TW
dc.subjectspoken term detectionen
dc.subjectinformation retrievalen
dc.title以口語查詢之非督導式口語詞彙偵測zh_TW
dc.titleUnsupervised Spoken Term Detection with Spoken Queriesen
dc.typeThesis
dc.date.schoolyear100-2
dc.description.degree博士
dc.contributor.oralexamcommittee雷少民,陳信宏,王小川,簡立峰,陳信希
dc.subject.keyword口語詞彙偵側,資訊檢索,zh_TW
dc.subject.keywordspoken term detection,information retrieval,en
dc.relation.page75
dc.rights.note有償授權
dc.date.accepted2012-08-15
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電信工程學研究所zh_TW
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-101-1.pdf
  未授權公開取用
2.7 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved