請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/72725完整後設資料紀錄
| DC 欄位 | 值 | 語言 |
|---|---|---|
| dc.contributor.advisor | 李琳山 | |
| dc.contributor.author | Gene-Ping Yang | en |
| dc.contributor.author | 楊靖平 | zh_TW |
| dc.date.accessioned | 2021-06-17T07:04:34Z | - |
| dc.date.available | 2019-08-13 | |
| dc.date.copyright | 2019-08-13 | |
| dc.date.issued | 2019 | |
| dc.date.submitted | 2019-07-27 | |
| dc.identifier.citation | [1] Vinod Nair and Geoffrey E Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep learning, 2016. [3] Gene-Ping Yang, Chao-I Tuan, Hung-Yi Lee, and Lin-shan Lee, “Improved speech separation with time-and-frequency cross-domain joint embedding and clustering,” arXiv preprint arXiv:1904.07845, 2019. [4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. [5] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [6] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [7] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur, “Recurrent neural network based language model,” in Eleventh annual conference of the international speech communication association, 2010. [8] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645–6649. [9] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learning, 2013, pp. 1139–1147. [10] John Duchi, Elad Hazan, and Yoram Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011. [11] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [12] Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994. [13] Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [14] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014. [15] Fisher Yu and Vladlen Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015. [16] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017. [17] Todd K Moon, “The expectation-maximization algorithm,” IEEE Signal processing magazine, vol. 13, no. 6, pp. 47–60, 1996. [18] Ozgur Yilmaz and Scott Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Transactions on signal processing, vol. 52, no. 7, pp. 1830–1847, 2004. [19] Shoji Makino, Hiroshi Sawada, and Shoko Araki, “Frequency-domain blind source separation,” in Blind Speech Separation, pp. 47–78. Springer, 2007. [20] Shoko Araki, Hiroshi Sawada, and Shoji Makino, “Blind speech separation in a meeting situation with maximum snr beamformers,” in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07. IEEE, 2007, vol. 1, pp. I–41. [21] Hyeong-Seok Choi, Janghyun Kim, Jaesung Huh, Adrian Kim, Jung-Woo Ha, and Kyogu Lee, “Phase-aware speech enhancement with deep complex u-net,” in International Conference on Learning Representations, 2019. [22] Dario Rethage, Jordi Pons, and Xavier Serra, “A wavenet for speech denoising,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5069–5073. [23] Yuxuan Wang, Arun Narayanan, and DeLiang Wang, “On training targets for supervised speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 22, no. 12, pp. 1849–1858, 2014. [24] Yi Luo, Zhuo Chen, John R Hershey, Jonathan Le Roux, and Nima Mesgarani, “Deep clustering and conventional networks for music separation: Stronger together,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 61–65. [25] Zhuo Chen, Yi Luo, and Nima Mesgarani, “Deep attractor network for single-microphone speaker separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 246–250. [26] Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 696–700. [27] Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019. [28] John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 31–35. [29] Zhong-Qiu Wang, Ke Tan, and DeLiang Wang, “Deep learning based phase reconstruction for speaker separation: A trigonometric perspective,” arXiv preprint arXiv:1811.09010, 2018. [30] Yi Luo, Zhuo Chen, and Nima Mesgarani, “Speaker-independent speech separation with deep attractor network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 4, pp. 787–796, 2018. [31] Zhong-Qiu Wang, Jonathan Le Roux, and John R Hershey, “Alternative objective functions for deep clustering,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 686–690. [32] Zhong-Qiu Wang, Jonathan Le Roux, DeLiang Wang, and John R Hershey, “End-to-end speech separation with unfolded iterative phase reconstruction,” arXiv preprint arXiv:1804.10204, 2018. [33] Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, and John R Hershey, “Single-channel multi-speaker separation using deep clustering,” arXiv preprint arXiv:1607.02173, 2016. [34] Morten Kolbæk, Dong Yu, Zheng-Hua Tan, Jesper Jensen, Morten Kolbaek, Dong Yu, Zheng-Hua Tan, and Jesper Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 25, no. 10, pp. 1901–1913, 2017. [35] Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 241–245. [36] Emmanuel Vincent, Rémi Gribonval, and Cédric Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006. [37] Mirco Ravanelli and Yoshua Bengio, “Speaker recognition from raw waveform with sincnet,” arXiv preprint arXiv:1808.00158, 2018. [38] Renjie Liao, Alex Schwing, Richard Zemel, and Raquel Urtasun, “Learning deep parsimonious representations,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds., pp. 5076–5084. Curran Associates, Inc., 2016. [39] J Thiemann, N Ito, and E Vincent, “Diverse environments multichannel acoustic noise database (demand),” 2013. [40] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 776–780. [41] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017. | |
| dc.identifier.uri | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/72725 | - |
| dc.description.abstract | 本論文之主軸在探討語者無關的語音分離(Speaker Independent Speech Separation)技術,亦即在沒有語者資訊的情況下要把兩個以上的語者混雜語音分離出來。這在許多語音處理系統中都很有用,包含語音辨識、語者識別等等。當音訊中出現兩個語者以上的語音時,我們的目標就是將這些具備相近特性的語音分離出來。目前以深層學習方法處理這個問題主要分為兩大主流:頻域方法以及時域方法。兩者最大的不同在於模型的輸入,一個輸入的是原始時域訊號,另一個的輸入為經短時傅立葉轉換後所得的時頻譜。這兩種方法也分別使用了不同的模型架構,以處理這兩種不同的輸入,然而這些方法都各有缺點。
本論文提出基於時頻跨域共同嵌入及聚類之分離技術,可以讓兩種不同領域的輸入訊號(時域和頻域)能夠互相參考。我們主要是基於類神經網路中的卷積式類神經網路建模,而本輪文所提出的方法是截至目前為止語者無關的語音分離技術中表現最好的演算法之一。我們將在本文主要分析不同類神經模組對於此問題的影響,並透過實驗數據分析不同模組在解決語者無關語音分離問題時的優缺點。 | zh_TW |
| dc.description.abstract | The main topic of this thesis is to explore Speaker Independent Speech Separation technique, that is, to separate two or more speaker in a mixed speech without the speaker information. This is useful in many speech processing systems, including speech recognition, speaker recognition, etc. When there are two or more speakers in the audio, our goal is to separate these voices with similar characteristics. At present, deep learning method is mainly divided into two major mainstreams: the frequency domain method and the time domain method. The biggest difference between the two is the input of the model, one input is the original time domain waveform, and the other input is the frequency domain spectrum obtained by short-time Fourier transform. These two methods also use different model architectures to handle these two different inputs, but each has its own drawbacks.
This paper proposes a separation technique based on time-and-frequency cross-domain joint embedding and clustering, which allows two different fields of input signals (time domain and frequency domain) to be referenced to each other. We are mainly based on convolution-like neural network modeling, and the method proposed in this round is one of the best performing algorithms in speech-independent speech separation technology. In this paper, we will mainly analyze the influence of different types of neural modules on this problem, and analyze the advantages and disadvantages of different modules in solving the speaker-independent speech separation problem through experimental data. | en |
| dc.description.provenance | Made available in DSpace on 2021-06-17T07:04:34Z (GMT). No. of bitstreams: 1 ntu-108-R06944010-1.pdf: 8492788 bytes, checksum: d0e42c759e6c19cf011f94fed993fb75 (MD5) Previous issue date: 2019 | en |
| dc.description.tableofcontents | 口試委員會審定書 i
誌謝 ii 中文摘要 v 一、導論 1 1.1 研究動機 1 1.2 研究方向 3 1.3 主要貢獻 4 1.4 章節安排 4 二、背景知識 6 2.1 類神經網路 6 2.1.1 簡介 6 2.1.2 訓練方式 8 2.1.3 遞迴式類神經網路 10 2.1.4 卷積式類神經網路 12 2.1.5 擴張卷積 14 2.1.6 深度分離卷積 15 2.2 K-平均分解演算法 19 2.3 語音分離 21 2.3.1 簡介 21 2.3.2 排列無關訓練 22 2.3.3 頻域語音分離演算法 23 2.3.4 時域語音分離演算法 26 2.4 本章總結 27 三、結合跨領域之分離演算法 29 3.1 簡介 29 3.2 模型架構 32 3.2.1 編碼器(Encoder) 32 3.2.2 分離器(Separator) 34 3.2.3 解碼器(Decoder) 42 3.3 本章總結 42 四、實驗設計與結果探討 45 4.1 簡介 45 4.2 資料集 45 4.3 實驗設置 46 4.4 實驗評量方法 47 4.5 實驗結果與分析 - (1) 48 4.5.1 時域特徵和頻域特徵對模型之重要性 49 4.5.2 跨域特徵對模型之影響 52 4.5.3 高維嵌入空間及聚類對語音分離效果之影響 54 4.5.4 時頻跨域特徵結合高維嵌入空間之模型分析 55 4.5.5 和目前其它最新研究結果之比較 58 4.6 實驗結果與分析 - (2) 63 4.6.1 隨機固定排列順序的訓練 64 4.6.2 使用預訓練之排列順序 65 4.7 本章總結 66 五、結論與展望 69 5.1 研究貢獻與討論 69 5.2 未來展望 70 參考文獻 71 | |
| dc.language.iso | zh-TW | |
| dc.subject | 雞尾酒問題 | zh_TW |
| dc.subject | 語音分離 | zh_TW |
| dc.subject | 深度聚類 | zh_TW |
| dc.subject | Speech separation | en |
| dc.subject | Cocktail party problem | en |
| dc.subject | Deep clustering | en |
| dc.title | 基於時頻跨域共同嵌入及聚類之語音分離 | zh_TW |
| dc.title | Speech Separation with Time-and-Frequency Cross-Domain
Joint Embedding and Clustering | en |
| dc.type | Thesis | |
| dc.date.schoolyear | 107-2 | |
| dc.description.degree | 碩士 | |
| dc.contributor.oralexamcommittee | 陳信宏,鄭秋豫,王小川,李宏毅 | |
| dc.subject.keyword | 語音分離,雞尾酒問題,深度聚類, | zh_TW |
| dc.subject.keyword | Speech separation,Cocktail party problem,Deep clustering, | en |
| dc.relation.page | 76 | |
| dc.identifier.doi | 10.6342/NTU201901849 | |
| dc.rights.note | 有償授權 | |
| dc.date.accepted | 2019-07-29 | |
| dc.contributor.author-college | 電機資訊學院 | zh_TW |
| dc.contributor.author-dept | 資訊網路與多媒體研究所 | zh_TW |
| 顯示於系所單位: | 資訊網路與多媒體研究所 | |
文件中的檔案:
| 檔案 | 大小 | 格式 | |
|---|---|---|---|
| ntu-108-1.pdf 未授權公開取用 | 8.29 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。
