源自聲學語音學、用於有伴奏歌唱音訊分析之概似模型

Yu-Ren Chien; 簡御仁

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4020

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	鄭士康,王新民
dc.contributor.author	Yu-Ren Chien	en
dc.contributor.author	簡御仁	zh_TW
dc.date.accessioned	2021-05-13T08:40:35Z	-
dc.date.available	2016-03-08
dc.date.available	2021-05-13T08:40:35Z	-
dc.date.copyright	2016-03-08
dc.date.issued	2016
dc.date.submitted	2016-01-29
dc.identifier.citation	[1] R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. Bello. MedleyDB: A multitrack dataset for annotation-intensive MIR research. In Proc. the 15th International Society for Music Information Retrieval Conference (ISMIR), 2014. [2] J. C. Brown and M. S. Puckette. An efficient algorithm for the calculation of a constant Q transform. J. Acoust. Soc. Am., 92(5):2698–2701, 1992. [3] Y.-R. Chien, H.-M. Wang, and S.-K. Jeng. Simulated formant modeling of accompanied singing signals for vocal melody extraction. In Proc. the 9th Sound and Music Computing Conference (SMC), 2012. [4] K. Dressler. An auditory streaming approach for melody extraction from polyphonic music. In Proc. the 12th International Society for Music Information Retrieval Conference (ISMIR), 2011. [5] J. L. Durrieu, G. Richard, B. David, and C. Févotte. Source/filter model for unsupervised main melody extraction from polyphonic audio signals. IEEE Transactions on Audio, Speech, and Language Processing, 18(3):564–575, March 2010. [6] D. P. W. Ellis and G. E. Poliner. Classification-based melody transcription. Mach. Learn., 65(2-3):439–456, 2006. [7] G. Fant. Acoustic theory of speech production with calculations based on X-ray studies of Russian articulations. The Hague: Mouton, 1970. [8] G. Fant. The LF-model revisited. Transformations and frequency domain analysis. STL-QPSR, 1995. [9] H. Fujihara, M. Goto, J. Ogata, and H. G. Okuno. LyricSynchronizer: Automatic synchronization system between musical audio signals and lyrics. IEEE Journal of Selected Topics in Signal Processing, 5(6):1252–1261, October 2011. [10] M. Goto. A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals. Speech Communication, 43:311–329, 2004. [11] J. W. Hawks and J. D. Miller. A formant bandwidth estimation procedure for vowel synthesis. J. Acoust. Soc. Am., 97(2):1343–1344, 1995. [12] C.-L. Hsu and J.-S. R. Jang. On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset. IEEE Trans. Audio, Speech, Lang. Process., 18(2):310–319, Feb 2009. [13] C.-L. Hsu, D. Wang, and J.-S. Jang. A trend estimation algorithm for singing pitch detection in musical recordings. In 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 393–396, May 2011. [14] D. Iskandar, Y. Wang, M.-Y. Kan, and H. Li. Syllabic level automatic synchronization of music signals and text lyrics. In Proc. the 14th Annual ACM International Conference on Multimedia, 2006. [15] ISO 226. Acoustics—normal equal-loudness contours, 2003. [16] E. Joliveau, J. Smith, and J. Wolfe. Vocal tract resonances in singing: The soprano voice. Journal of the Acoustical Society of America, 116(4):2434–2439, October 2004. [17] S. Joo, S. Park, S. Jo, and C. D. Yoo. Melody extraction based on harmonic coded structure. In Proc. the 12th International Society for Music Information Retrieval Conference (ISMIR), 2011. [18] M.-Y. Kan, Y. Wang, D. Iskandar, T. L. Nwe, and A. Shenoy. LyricAlly: Automatic synchronization of textual lyrics to acoustic music signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(2):338–349, February 2008. [19] R. D. Kent and C. Read. The acoustic analysis of speech. Singular/Thomson Learning, 2002. [20] K. Lee and M. Cremer. Segmentation-based lyrics-audio alignment using dynamic programming. In Proc. the 9th International Conference on Music Information Retrieval (ISMIR), 2008. [21] H.-L. Lu. Toward a High-Quality Singing Synthesizer with Vocal Texture Control. PhD thesis, Stanford University, 2002. [22] M. Mauch, H. Fujihara, and M. Goto. Integrating additional chord information into HMM-based lyrics-to-audio alignment. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):200–210, January 2012. [23] R. McAulay and T. Quatieri. Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(4):744–754, Aug 1986. [24] A. Mesaros and T. Virtanen. Automatic recognition of lyrics in singing. EURASIP Journal on Audio, Speech, and Music Processing, 2010(546047):1–11, 2010. [25] G. E. Poliner, D. P. W. Ellis, A. F. Ehmann, E. Gómez, S. Streich, and B. Ong. Melody transcription from music audio: Approaches and evaluation. IEEE Trans. Audio, Speech, Lang. Process., 15(4):1247–1256, 2007. [26] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, Feb 1989. [27] J. Salamon and E. Gómez. Melody extraction from polyphonic music signals using pitch contour characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20(6):1759–1770, Aug 2012. [28] G. L. Salomão and J. Sundberg. What do male singers mean by modal and falsetto register? An investigation of the glottal voice source. Logopedics Phoniatrics Vocology, 34:73–83, 2009. [29] C. Sutton. Transcription of vocal melodies in popular music. Master’s thesis, Queen Mary, University of London, 2006. [30] H. Tachibana, T. Ono, N. Ono, and S. Sagayama. Melody line estimation in homophonic music audio signals based on temporal-variability of melodic source. In 2010 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 425–428, March 2010. [31] C. H. Wong, W. M. Szeto, and K. H. Wong. Automatic lyrics alignment for Cantonese popular music. Multimedia Systems, 12:307–323, 2007. [32] W. R. Zemlin. Speech and hearing science: anatomy and physiology. Allyn and Bacon, Boston, 4th ed. edition, 1998.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4020	-
dc.description.abstract	本論文所探討的主題是有伴奏歌唱錄音的旋律分析以及歌詞分析。為了有效進行此分析，本論文提出一種獨特的概似模型作為方法的核心，它巧妙地結合了聲學語音學的知識以及實際蒐集而得的資料。此模型的基本要素是一套音色吻合度以及發聲狀態吻合度的量化評估方式，可為任一候選基本頻率（基頻）或者候選母音�發聲狀態進行評分。音色吻合度意指某個基頻值的諧波振幅序列所呈現之音色與參考音色之間的相似程度，而參考音色的定義則來自一小組歌聲音色範例。為特定基頻估算音色吻合度時，需要對所有音色範例進行基頻的修改，本論文提出的修改方式利用聲學語音學的模型，將修改前的聲帶波形以及共振峰頻率予以保留。此一概似模型在發聲狀態的部份，對弦波進行偵測、追蹤以及刪減的處理，以便在估計歌聲音量的同時，將伴奏的干擾減至最低。最後基頻或音節的估計值，是由概似模型與事前的順序模型共同決定。在使用多個資料集進行系統測試之前，此方法所涉及的所有數值參數均已完成最佳化，且使用的是數個不與測試資料有任何重複的發展資料集。對照實驗顯示，音色吻合度的使用與否，會在整體旋律正確率上面造成 13% 的差距，同時也會在平均標準化歌詞對齊誤差上面，造成 7% 的差距。	zh_TW
dc.description.abstract	This dissertation addresses melodic and lyrics analysis of accompanied singing recordings. Central to my approach are likelihood models that integrate acoustic-phonetic knowledge and real-world data. These models are based on a timbral fitness score and a voicing fitness score evaluated for each fundamental frequency (F0) or vowel/voicing candidate. Timbral fitness is measured for the partial amplitudes of an F0 value, with respect to a small set of vocal timbre examples. This F0-specific measurement of timbral fitness depends on an acoustic-phonetic F0 modification of each timbre example, which preserves glottal pulse shape and formant frequencies. In the voicing part of the likelihood models, sinusoids are detected, tracked, and pruned to give loudness values that minimize interference from the accompaniment. A final F0 or syllable estimate is determined by a prior sequential model in addition to the likelihood model. The numerical parameters involved in my approach were optimized on several development sets from different sources before the system was evaluated on multiple test sets separate from these development sets. Controlled experiments show that use of the timbral fitness score accounts for a 13% difference in overall melodic accuracy, and a 7% difference in average normalized lyrics alignment error.	en
dc.description.provenance	Made available in DSpace on 2021-05-13T08:40:35Z (GMT). No. of bitstreams: 1 ntu-105-D98942017-1.pdf: 1487207 bytes, checksum: f24822476b42468c178b9e95563d29fd (MD5) Previous issue date: 2016	en
dc.description.tableofcontents	口試委員會審定書: i Acknowledgments: iii 摘要: v Abstract: vii 1 Introduction: 1 1.1 Motivation: 1 1.2 Objective: 2 1.3 Previous Approaches: 3 1.3.1 Knowledge-Based Approaches: 3 1.3.2 Data-Driven Approaches: 4 1.3.3 Source Separation Techniques: 4 1.3.4 Comparison: 5 1.4 Contribution: 6 1.5 Structure of the Document: 7 2 Acoustic-Phonetic F0 Modification of Vocal Sinusoids: 9 2.1 Model of Human Voice Production: 9 2.1.1 Glottal Excitation: 10 2.1.2 Vocal Tract Filter: 12 2.2 Source-Filter Analysis: 14 2.2.1 Formulation: 14 2.2.2 Optimization: 15 2.3 Source-Filter Synthesis: 16 3 Vocal Melody Extraction Based on an F0 Likelihood Model: 19 3.1 System Overview: 19 3.2 Acoustic-Phonetic Model of F0 Likelihood: 19 3.2.1 Timbral Fitness Measure: 21 3.2.2 Loudness Measure: 24 3.3 Vocal Melody Extraction: 25 3.3.1 Vocal F0 Estimation: 26 3.3.2 Vocal Rest Detection: 28 3.4 Experiments: 29 3.4.1 Data Sets: 29 3.4.2 Performance Measures: 31 3.4.3 Results on the Development Sets: 32 3.4.4 Results on the Test Sets: 33 3.4.5 Examples: 34 3.4.6 Results of Controlled Experiments: 35 4 Lyrics Alignment Based on a Vowel Likelihood Model: 49 4.1 System Overview: 49 4.2 Acoustic-Phonetic Model of Vowel Likelihood: 50 4.2.1 Timbral Fitness Measure: 51 4.2.2 Voicing Fitness Measure: 54 4.3 Syllabic Position Estimation: 55 4.3.1 Lyrics Preprocessing: 55 4.3.2 Likelihood Evaluation: 56 4.3.3 Syllable Selection: 57 4.4 Experiments: 58 4.4.1 Data Sets: 58 4.4.2 Performance Measures: 59 4.4.3 Results on the Development Sets: 60 4.4.4 Results on the Test Sets: 61 4.4.5 Example: 62 4.4.6 Results of Controlled Experiments: 63 5 Conclusions: 69 5.1 Contribution: 69 5.2 Further Work: 70 Bibliography: 71
dc.language.iso	en
dc.title	源自聲學語音學、用於有伴奏歌唱音訊分析之概似模型	zh_TW
dc.title	Acoustic-Phonetic Likelihood Models for Analysis of Accompanied Singing Audio	en
dc.type	Thesis
dc.date.schoolyear	104-1
dc.description.degree	博士
dc.contributor.oralexamcommittee	張智星,古鴻炎,劉奕汶
dc.subject.keyword	旋律抽取,歌詞對齊,歌聲,聲學語音學,基頻修改,聲帶波形,共振峰頻率,	zh_TW
dc.subject.keyword	melody extraction,lyrics alignment,singing voice,acoustic phonetics,F0 modification,glottal pulse shape,formant frequency,	en
dc.relation.page	73
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2016-01-29
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電信工程學研究所	zh_TW
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-105-1.pdf	1.45 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。