運用條件式變分自編碼器進行多任務分子設計

孫肇廷; Chao-Ting Sun

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90494

標題:	運用條件式變分自編碼器進行多任務分子設計 Multi-Task Molecular Design Using Conditional Variational Autoencoder Based Transformer
作者:	孫肇廷 Chao-Ting Sun
指導教授:	林祥泰 Shiang-Tai Lin
關鍵字:	機器學習,深度生成模型,條件變分自編碼器,Transformer模型,多任務分子設計, machine learning,deep generative model,conditional variational autoencoder,Transformer,multi-task molecular design,
出版年 :	2023
學位:	碩士
摘要:	機器學習中的深度生成模型具有快速且不受理論模型限制的優點。隨著計算能力、模型、最佳化算法和大型開源分子資料庫的演進，這些模型如今已被廣泛應用於各種分子設計任務。本研究的目的是將一種深度生成模型應用於多個分子設計任務，以實現快速且精確的生成符合目標條件的分子。我們利用MOSES基準平台提供的158萬個中性分子，使用SMILES代表這些分子，並基於條件變分自編碼器(CVAE)訓練一個Transformer模型。它能夠在無條件、性質條件、結構條件，以及性質和結構條件的組合下生成SMILES。性質條件採用分配係數(partition coefficient)、拓撲極性表面積(topological polar surface area)和藥物相似定量估計(quantitative estimate of drug-likeness)，而結構條件使用Bemis-Murcko骨架(scaffold)。訓練後，負責不同任務的模型各自生成了大量的SMILES。我們通過有效性(validity)驗證模型對SMILES和化學結構規則的理解，通過新穎性(novelty)評估模型發現新分子的能力，並通過獨特性(uniqueness)和內部多樣性(internal diversity)驗證模型生成不同且多樣分子的能力。通過與訓練數據的比較，驗證模型是否準確學習了訓練數據的分佈。對於以性質為條件的模型，我們計算生成分子的系統和絕對誤差。對於以結構為條件的模型，我們則計算符合結構條件的比例。結果顯示，在沒有任何約束條件下所生成的SMILES具備極高的有效性、獨特性與多樣性，並與訓練數據中的12個分子描述符(descriptor)的分佈幾乎一致。在性質約束下所生成的SMILES中接近一半符合性質條件，而且模型能夠在沒有任何訓練數據的區域中發現新的分子結構。在結構約束下，無論是模型在訓練過程中看過或沒看過的結構，幾乎所有生成的SMILES都能符合條件。甚至簡單碳鏈也能作為結構條件。在結構和性質的組合約束下，約10%生成的SMILES符合條件。不過在結構條件不大以及性質條件附近的訓練數據充足的情況下，對於未見過的結構條件生成的SMILES仍能達到一半符合條件。另外，由於變分自編碼器能夠創造稠密且平滑(smooth)的潛在空間(latent space)作為分子表示，因此可以在潛在空間中進行內插以創造與兩分子相似結構的分子。我們通過多次的分子內插計算生成分子的結構相似度並觀察到結構的平滑性。我們量化了這種平滑性，並展示了變分自編碼器相較於自編碼器在分子內插方面的優越性。 Deep generative models in machine learning have the advantages of being fast and not limited by theoretical models. With the evolution of computing power, models, optimization algorithms, and large open-source molecular databases, these models are now extensively used across various tasks in molecular design. This study aims to apply a deep generative model to several molecular design tasks with the goal of rapidly and precisely creating molecules that fulfill specified conditions. We leverage 1.58 million neutral molecules from the MOSES benchmark platform, represent these molecules using SMILES, and train a Transformer model based on a conditional variational autoencoder (CVAE). It can generate SMILES under unrestricted conditions, property-based conditions, a structural condition, and a combination of both property and structural conditions. The property conditions employ the partition coefficient, topological polar surface area, and quantitative estimate of drug-likeness, while the structural condition is defined by the Bemis-Murcko scaffold. Following the training, the models responsible for different tasks each generate a substantial number of SMILES. We then validate the models’ understanding of the syntactic and semantic rules of SMILES through validity and assess their ability to discover new, unique, and diverse molecules through novelty, uniqueness, and internal diversity. Comparisons with the training data are used to verify whether the training data distribution has been accurately learned. For property-conditioned models, we compute the systematic and absolute errors. For structure-conditioned models, we measure the percentage of the generated molecules that adhere to the specified structure. Results show that, without any constraints, the model can generate highly valid, novel, unique, and diverse SMILES that align with the distribution of the 12 molecular descriptors in the training data. Under property constraints, close to half of the SMILES result in molecules conforming to property conditions, and the model can discover novel molecular structures in areas without any training data. Under a structural constraint, the model can nearly perfectly generate molecules that conform to seen and unseen structures and can be constrained by simple carbon chains that are not Bemis-Murcko scaffolds. Under combined structural and property constraints, about 10% of the generated SMILES conform to the conditions. However, in cases where the structural constraint is minimal and there is ample training data near the property condition, about half of the SMILES generated under unseen structural conditions still meet the criteria Moreover, due to the variational autoencoder's ability to create a dense and smooth latent space for representing molecules, molecules with structures similar to two given molecules can be created through interpolation in this latent space. We calculate the structural similarity of the generated molecules through multiple molecular interpolations and observe the smoothness of the structures. We quantify this smoothness and demonstrate the superiority of variational autoencoders over standard autoencoders in terms of molecular interpolation.
URI:	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/90494
DOI:	10.6342/NTU202302721
全文授權:	同意授權(限校園內公開)
電子全文公開日期:	2025-12-31
顯示於系所單位：	化學工程學系

文件中的檔案：

檔案	大小	格式
ntu-111-2.pdf 授權僅限NTU校內IP使用（校園外請利用VPN校外連線服務）	13.91 MB	Adobe PDF

顯示文件完整紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料（如：文字、圖片、PDF）並使其易於取用。