用於分子特性預測的電腦輔助藥物設計

Jen-Hao Chen; 陳人豪

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/85760

標題:	用於分子特性預測的電腦輔助藥物設計 Computer-Aided Drug Design for molecular property prediction
作者:	Jen-Hao Chen 陳人豪
指導教授:	曾宇鳳(Yufeng Jane Tseng)
關鍵字:	生物科學,藥物發現,化學資訊學, Biological sciences,Drug discovery,Cheminformatics,
出版年 :	2022
學位:	博士
摘要:	基於指紋、基於特徵和基於分子圖的表示都已在其他的研究中與不同的深度學習方法一起用於預測分子特性。不同的分子表示已經被清楚地證明會影響模型預測和可解釋性。我們回顧了不同的分子表示方法，並專注於使用圖形和線性表示方式進行深度學習模型的建立。通常，在計算其特性時，人們會使用一種固定的規範化學結構流程去表示一種分子。我們仔細檢查了表示單個分子的簡化分子線性輸入規範 (SMILES) 符號，並建議使用 SMILES 中的完整列舉以達到更高的模型預測準確性。我們使用了卷積神經網絡 (CNN)的技術來建立模型。SMILES 的完整列舉可以改進分子在模型上的呈現並以所有可能的角度描述分子。用這種方法訓練出的 CNN 模型在處理大型數據集時非常穩健，因為無需加入額外的化學知識來預測溶解度。此外，傳統上很難使用神經網絡來解釋化學結構對單個屬性的貢獻。我們展示了在解碼網絡中使用注意力機制來檢測與溶解度相關的分子部分，從CNN模型中解釋了化學結構對於預測屬性的影響。生成用於預測分子性質的最佳深度學習模型的關鍵是測試和應用各種優化方法。雖然過去在製藥領域之外的不同研究中的各個優化方法都成功地提高了模型性能，但當小心地應用這些方法和實踐特定的優化方法組合時，模型效能可能可以得到更好的提升。我們使用和討論了文獻中出現的三種高性能優化方法。這些方法已被證明可以顯著提高其他領域的模型性能。我們最終找到一種通用程序，能夠針對不同分子特性去訓練出效果更優化的 CNN 模型。這三種技術分別是針對化合物 SMILES 表示的不同列舉比率去動態調整批量大小策略、用於選擇模型超參數的貝葉斯優化方法以及整合以化學特徵作為輸入資料的前饋神經網絡獲得的特徵與CNN網路學習的分子特徵向量進行結果預測。我們總共使用了七種不同的分子特性（水溶性、親脂性、水合能、電子特性、血腦屏障通透性和抑制）。我們演示了這三種模型優化技術中的每一種如何影響模型，以及最佳模型結合使用貝葉斯優化和動態批量大小調整中受益。 Fingerprint based, feature based, and molecular graph-based representations have all been used with different deep learning methods for prediction of the molecular properties. It has been clearly demonstrated that different molecular representations impact the model prediction and explainability. We reviewed different representations and also focused on using graph and line notations for modelling. In general, one canonical chemical structure is used to represent one molecule when computing its properties. We carefully examined the commonly used simplified molecular input line entry specification (SMILES) notation representing a single molecule and proposed to use the full enumerations in SMILES to achieve better accuracy. A convolutional neural network (CNN) was used. The full enumeration of SMILES can improve the presentation of a molecule and describe the molecule with all possible angles. This CNN model can be very robust when dealing with large datasets since no additional explicit chemistry knowledge is necessary to predict the solubility. Also, traditionally it is hard to use a neural network to explain the contribution of chemical substructures to a single property. We demonstrated the use of attention in the decoding network to detect the part of a molecule that is relevant to solubility, which can be used to explain the contribution from the CNN. The key to generating the best deep learning model for predicting molecular property is to test and apply various optimization methods. While individual optimization methods from different past works outside the pharmaceutical domain each succeeded in improving the model performance, better improvement may be achieved when specific combinations of these methods and practices are applied. Three high-performance optimization methods in the literature that have been shown to dramatically improve model performance from other fields are used and discussed, eventually resulting in a general procedure for generating optimized CNN models on different properties of molecules. The three techniques are the dynamic batch size strategy for different enumeration ratios of the SMILES representation of compounds, Bayesian optimization for selecting the hyperparameters of a model, and feature learning using chemical features obtained by a feedforward neural network, which are concatenated with the learned molecular feature vector. A total of seven different molecular properties (water solubility, lipophilicity, hydration energy, electronic properties, blood–brain barrier permeability and inhibition) are used. We demonstrate how each of the three techniques can affect the model and how the best model can generally benefit from using Bayesian optimization combined with dynamic batch size tuning.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/85760
DOI:	10.6342/NTU202200783
全文授權:	同意授權(全球公開)
電子全文公開日期:	2022-07-05
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
U0001-1905202217054200.pdf	2.61 MB	Adobe PDF	檢視/開啟

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。