請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93439
標題: | 以機器學習因果推論挑選社會因子進行XGBoost房價預測 Selecting Social Factors for XGBoost House Price Prediction Using Machine Learning Causal Inference |
作者: | 林宗昇 Tsung-Sheng Lin |
指導教授: | 張瑞益 Ray-I Chang |
關鍵字: | 社會因子,XGBoost,貝氏網路,因果推論,房價預測, Social Factor,XGBoost,Bayesian Network,Causality Inference,House Price Prediction, |
出版年 : | 2024 |
學位: | 碩士 |
摘要: | 過去台灣房價預測在挑選資料集特徵時有幾個問題,首先是實價登錄資料集多以縣市的全行政區進行,這樣的範圍選取忽略不同行政區之間的差異,會造成預測結果的誤差。此外,對於社會因子特徵的選取目前多只考慮總體經濟變數,與一般大眾的感受較不直覺。而經驗法則的挑選方式往往會使挑選的特徵缺乏客觀性。
因此本研究在預測範圍上改以行政區為單位,除探討可否提高預測精準度,也想了解在不同行政區主要影響房價的因子是否相同。並著重在新興開發區域與離市中心較遠的近郊區域。而代表社會因子特徵的資料集除經常使用的總體經濟變數,本研究也納入了政府政策、企業建商投資、交通設施、人口結構以及COVID-19疫情,並使用CPSS (Cyber-Physical-Social System)本體論架構對上述社會因子進行分類。對於挑選放入機器學習模型的特徵,本研究使用貝式網路(Bayesian networks)及因果推論(Causality Inference),排列出社會因子特徵影響目標變數的順序。並依序將社會因子放入XGBoost (Extreme Gradient Boost)模型進行訓練得到預測結果及找出在本研究架構下適當的特徵數量為10個。 實驗顯示本研究預測結果與未使用社會因子相比可改善MAPE (Mean Absolute Percentage Error)值5%、RMSE (Root Mean Square Error)值11%、R2_SCORE (coefficient of determination, R^2)值4%。與使用經驗法則挑選社會因子相比可改善MAPE值1.34%、RMSE值3.44%、R2_SCORE值1.31%。此外以分區進行預測與全區相比,可使MAPE值改善2%。經實證本研究所提出挑選因子的方法相對經驗法則挑選更有依據及可解釋性,且可改善預測精準度。也為房價預測領域提供新的研究思路,有助於投資者、開發商和政策制定者更準確地理解和預測市場動態。 House Price Prediction in Taiwan has encountered several problems in feature selection. The data set of Actual Price Registration is often considered at the level of the entire administrative division of cities and counties, ignoring the differences between various administrative districts, leading to prediction errors. Additionally, the selection of social factor features currently often only considers macroeconomic variables, which are less intuitive to the general public. The use of empirical rules for feature selection often results in a lack of objectivity in the chosen features. Therefore, this study changes the prediction scope to the administrative district level to explore whether prediction accuracy can be improved and to understand whether the main factors influencing house price differ across different administrative districts. The study focuses on emerging development areas and suburban areas farther from city centers. In addition to commonly used macroeconomic variables, this research includes government policies, corporate investments, transportation facilities, population structure, and the COVID-19 pandemic as datasets representing social factor features. These social factors are classified using the CPSS (Cyber-Physical-Social System) ontology framework. For feature selection in the machine learning model, Bayesian networks and causality inference are used to rank the influence of social factor features on the target variable. The social factors are sequentially input into the XGBoost (Extreme Gradient Boost) model to obtain prediction results and determine that the appropriate number of features under this research framework is 10. Experiments show that the prediction results of this study improve the MAPE (Mean Absolute Percentage Error) by 5%, RMSE (Root Mean Square Error) by 11%, and R2_SCORE (coefficient of determination, R^2) by 4% compared to not using social factors. Compared to selecting social factors based on empirical rules, MAPE improved by 1.34%, RMSE by 3.44%, and R2_SCORE by 1.31%. Additionally, predicting by administrative district compared to the entire district improves MAPE by 2%. Empirical evidence demonstrates that the proposed method of factor selection is more justified and interpretable than empirical rule-based selection and can improve prediction accuracy. This study provides new research insights into the field of house price prediction, aiding investors, developers, and policymakers in more accurately understanding and predicting market dynamics. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/93439 |
DOI: | 10.6342/NTU202402749 |
全文授權: | 未授權 |
顯示於系所單位: | 工程科學及海洋工程學系 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-112-2.pdf 目前未授權公開取用 | 2.46 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。