新式發端與音階識別演算法於音樂信號處理

Ta Hsien; 冼達

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/47921

標題:	新式發端與音階識別演算法於音樂信號處理 New Onset and Pitch Detection Algorithms in Music Signal Processing
作者:	Ta Hsien 冼達
指導教授:	丁建均(Jian-Jiun Ding)
關鍵字:	音樂訊息檢索,聲音發端識別,發端偵測,MIDI,梅爾頻率倒頻係數,線性預估係數,時頻分析轉換,基頻信號偵測,希爾伯特黃轉換,Sub-harmonic Summation,Surf, Music information retrieval,voice activity detection,onset detection,MIDI,MFCC,LPC,STFT,pitch estimation,HHT,SHS,HFC,Surf,
出版年 :	2011
學位:	碩士
摘要:	現今的歌聲檢索系統用途十分廣泛。從1維訊號的聲音當中利用不同的轉換理論分析和分解訊號的特徵，進而對聲音之間的分析、合成技術、或是聲音的壓縮是關鍵的技術。應用到音樂訊號: 解析分類聲音的種類。應用到哼唱式語音訊號: 抓取相類似歌曲，進而對語音訊號做檢索功能。應用到說話語音訊號: 可以辨識語者的講話內容，進而將語音對應到文字做翻譯、進而達到語音檢索多層次應用目的。所以依照聲音的訊號的處理的目的、用途多方面等等不同，系統能有多方面設計。歌聲檢索系統主要可以依照不同的數學理論，將語音訊號強化、回復，之後將之切割分段分類，然後將每一段的特徵部分擷取出，進而比對特徵分布之間的關聯性當成某段音訊的編碼，之後將可以利用到多種動態規劃法、或是馬可夫模型等等的排序比對演算法將哼唱式語音訊號對於資料庫中存在的樂譜編碼做檢索並排序輸出。以下便是我們提出的改善的敘述與摘要。 (Filter Design) 針對哼唱式語音訊號來源，本篇論文設計出新的演算法架構，能達到比過去演算法更迅速與更精準的特徵分析。現今的語音訊號解析在前處理步驟上往往就是利用一些去雜訊的演算法，好比Wiener filter等等，將訊號回復或是強化其語音訊號特徵，接著透過特徵擷取演算法將訊號中切割出不同長度的單位，通常依照人耳聽覺感受是以”音節”作為切割的單位。然後將切割後的訊號做特徵擷取，可能取出”音色”、”基頻”特徵模型等等以供後續排序演算法做前處理。對於歌聲檢索系統而言，有多篇的文獻探討於哼唱式語音強化、以及特徵取樣的部分。對於雜訊的去除與強化訊號的部分，目前在通訊處理領域有多種先進的演算法如:轉變成雜訊、訊號的數學模型去分析；但是針對於哼唱式人耳聽覺感受的語音訊號，本論文提供了將哼唱聲音、雜訊對應於新的數學模型並提出新的理論和實作將訊號的回復和強化並提供完善的測試實驗數據。 (Onset detection) 第二個改進的地方。對於系統中的歌聲切割擷取部分目前文獻探討很多種關於每片段聲音的相位、頻率、或音量變化值當成切割擷取的重要依據，根據多篇論文的實作結果、目前以頻率特徵改變量或是相位變化當成語音切割的依據，往往會過度切割受到高頻雜訊的影響、並且耗時耗資源；如利用到音量變化作為語音切割的依據，又會往往受到背景雜訊或是斷句不明顯的語音訊號而大受影響。本篇論文在此根據歌聲數學模型提供了新的突破，將統合頻率、音量等特徵做聲音切割演算法，並且大量提升切割準確度，更符合人耳聽覺感受理論，並且本演算法能讓複雜度更低於過去文獻單純利用頻率變化特徵作為依據的切割並測試出比較數據。 (Pitch estimation) 這是第三個改進地方。單位語音片段之中基頻音頻可說是非常重要的關角色，不僅僅是語音特徵、人音辨識，還可以用在輔助語音切割，語音音頻追蹤方面的關鍵。在速度上和準確度上往往不能兼顧，我們提出的一種改善法能夠在大量背景雜訊干擾下仍有非常準確的表現，並且有完善的測試與比較。 (Adaptive MIDI number) 這是第四個改進地方。對於單位語音訊號中特徵往往是利用基頻部分作為代表。但是最多文獻利用到的Midi number作為將聲音每八度音切割成12份的音符作為代表，需要步驟是先將單位區間的聲音基頻部分擷取，然後將之對應至MIDI符號作為比對。根據實驗結果，一來基頻擷取演算法往往不夠精確，常受到音色分布、背景雜訊等等影響、平均只能達到75%，其後對應到的MIDI符號本意上設計是針對不同樂器機器之間的音準做調教而設計出，用來比對人耳聽覺的音準差往往和所期望的有所差距，導致後方的音樂檢索演算法結果會不如預期的出現錯誤。本篇論文對此提供了新的基頻偵測演算法，讓平均準確度高達95%，且新的自適應式理論修正MIDI符號、並依據人耳聽覺模型提出的音準差評估演算法，修改了過去單純利用MIDI符號差當成訊號間特徵的問題，能讓即使音準不好的個人都能建構出各自的自適應式MIDI符號。我們目前針對於現今的哼唱式語音訊號切割與特徵演算法提供大幅度的多種修正、並且根據實驗模擬的結果，能大幅度提升準確性、複雜度下降、並且滿足系統穩定性等等的優化條件，讓本論文演算法技術不僅僅能利用到哼唱式語音訊號上，往後還能針對不同的語音合成、編碼、到語音檢索等等應用有更佳的實驗結果。 There are many applications of the query by humming (QBH) system. It combines the techniques of feature selection, MIDI number analysis, and melody match processes for the 1-D voice signal. The core techniques include the signal transform theory, feature analysis, and the segmentation of voice signal, which can make us understand and classify the voice signal for more applications. Applying these analysis techniques in the QBH system, the similar songs in the database can be retrieved. Moreover, in the related applications, such as speaking-to-text, speaking translation and multi-lingual transcriptions can be included after speech recognition. The QBH system can be majorly separated into several processes. First, it emphasizes the features in the spectrum and removes the irrelevant noise. The onsets are obtained by the classification of the segmentation with different pitch features. Then the pitches are transformed into MIDI numbers as a series of code sequences. The outputs of the QBH system are obtained from comparing the pitches of humming signal with those of the songs in our database. It is called melody match, which utilizes dynamic programming, hidden Markov model…etc. for the arrangement and the similarity measurements. Besides, other improvements we proposed are shown below. (Filter design) Focusing on humming signal restoration, we proposed a new adaptive algorithm for filter design. It has the advantages of high analysis efficiency, high SNR ratio and small MSE with reliable stability. Compared with the conventional signal restoration algorithms, such as the Wiener filter and the Butterworth filter, it can improve the SNR ratio and reduce the reconstruction error. Many researches in tele-communication engineering focus on signal and noise analysis, transformation, and the feature extraction of voice signal. The FT transforms single signal into the freq domain and removes the noise. However, according to psychoacoustics, we proposed a new math model for representing the humming voice and used it for the signal restoration. After the implementation of our algorithms, we showed a variety of simulation results and compared the performance with the existing filters in Chapter 4. (Onset detection) The second improvement is related to the onset architecture. The “amplitude”, ”frequency”, and ”phase” based segmentations were proposed in many reference papers. According to the implementation and lots of comparisons in papers, the results have the trend of over-detection due to the specific noise characteristic. Moreover, amplitude fluctuation may cause under-detection due to the background noise interference and the attached successive sound. To overcome the above problems, we proposed a new onset architecture, which involved the features in both the spectrum and the time domains. It improves the accuracy to meet the human perception and has less complexity in implementation. Afterward, the complete test results and comparisons are shown in Chapter 4. (Pitch estimation) The third improvement is related to instantaneous frequency detection. The pitch extraction method is very important for the entire system. The pitch feature can be utilized for the speaker identification, classification, onset detection, and voice tracking. Therefore, in Chapter 6, we proposed a new improvement based on the sub-harmonics summation and has high accuracy under noise interference. (Adaptive MIDI) The fourth improvement is related to adaptive pitch representation. The most common pitch representation method is to use the MIDI number. It separates each octave into 12 notes and the instantaneous frequency can be easily mapped into its corresponding number. However, the standard MIDI numbers are designed for the connection among different musical instruments. There exists the difference between standard MIDI numbers and hearing perception. The accuracy rate of the new pitch estimation method is 95% and the adaptive MIDI numbers revise the measurement according to individuals to construct adaptive MIDI mapping and prevent the off-key cases. We also focused on the improvement of the entire onset detection system for retrieving the correct pitches. After many tests in a variety of aspects, it shows that the proposed method has the high accuracy rate of the onset detection, lower complexity, and high stability. Therefore, the algorithm we proposed can improve the QBH system, the voice signal analysis system, the music signal coding system. It can also further improve speech recognition system in the future.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/47921
全文授權:	有償授權
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-100-1.pdf 目前未授權公開取用	9.79 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。