Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電信工程學研究所
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91391
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor黃寶儀zh_TW
dc.contributor.advisorPolly Huangen
dc.contributor.author許誠zh_TW
dc.contributor.authorCheng Hsuen
dc.date.accessioned2024-01-26T16:17:53Z-
dc.date.available2024-01-27-
dc.date.copyright2024-01-26-
dc.date.issued2024-
dc.date.submitted2024-01-12-
dc.identifier.citation[1] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108–122, 2013.
[2] K. P. Burnham. Design and analysis methods for fish survival experiments based on release-recapture / Kenneth P. Burnham ... [et al.]. American Fisheries Society Bethesda, Md, 1987.
[3] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):790–799, 1995.
[4] R. M. CORMACK. Estimates of survival from the sighting of marked animals. Biometrika, 51(3-4):429–438, 12 1964.
[5] A. W. F. Edwards. Likelihood. Cambridge University Press, 1972.
[6] D. et al. Internet scale user-generated live video streaming: The twitch case. In Proceedings of PAM, pages 60–71, New York, NY, 02 2017. Springer.
[7] M. Halkidi and M. Vazirgiannis. Clustering validity assessment: finding the optimal partitioning of a data set. In Proceedings 2001 IEEE International Conference on Data Mining, pages 187–194, 2001.
[8] G. M. Jolly. Explicit estimates from capture-recapture data with both death and immigration-stochastic model. Biometrika, 52(1/2):225–247, 1965.
[9] C. Krebs. Ecological Methodology, 3rd ed. 2014. https://www.zoology.ubc.ca/~krebs/books.html(visited 2021-05-25).
[10] F. Lincoln. Calculating Waterfowl Abundance on the Basis of Banding Returns. Cir cular (United States. Department of Agriculture). U.S. Department of Agriculture, 1930.
[11] Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu. Understanding of internal clustering validation measures. In 2010 IEEE international conference on data mining, pages 911–916. IEEE, 2010.
[12] A. M. Mood, F. A. Graybill, and D. C. Boes. Introduction to the theory of statistics. McGraw-Hill New York, 3rd ed. edition, 1974.
[13] N. I. of Standards, Technology, and I. SEMATECH. NIST/SEMATECH e-Handbook of Statistical Methods.
[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
[15] C. G. J. PETERSEN. The yearly immigration of young plaice into the limfjord from the german sea, ect. Report of the Danish Biological Station for 1895, 6:1–48, 1896
[16] S. Pledger, K. H. Pollock, and J. L. Norris. Open capture-recapture models with heterogeneity: I. cormack-jolly-seber model. Biometrics, 59(4):786–794, 2003.
[17] B. L. S. Prakasa Rao. Maximum likelihood estimation for markov processes. Annals of the Institute of Statistical Mathematics, 24(1):333–345, Dec 1972.
[18] G. A. F. Seber. A note on the multiple-recapture census. Biometrika, 52(1-2):249–260, 06 1965.
[19] TwitchTracker. Twitch statistics and charts, 2021.
[20] C. Wang”. ”discovering twitch’s video delivery infrastructure utilizing cloudservices and vpns”.
[21] H. C. Y.-T. L. C. H. G.-T. T. P. H. Wei-Shiang Wung, Caleb Wang. Poster: 'Twitch's cdn as an open-population ecosystem, 08 2020.
[22] W.-S. Wung”. ”estimation on server population with mle-based cmr (capture-mark recapture) model”
-
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/91391-
dc.description.abstract近年來,串流媒體越來越熱門。其中,Twitch 主宰了遊戲直播的市場。在 2021 年,Twitch 擁有平均兩百七十多萬的線上即時觀眾與超過十萬名在線直播主。如同其他的串流媒體,Twitch 使用內容傳遞網路(Content Delivery Network)來提供服務給來自世界各地的廣大觀眾。內容傳遞網路可以降低內容傳播的延遲時間,是影響觀眾收看品質的關鍵之一。
在 2017 發表的一篇論文中,有團隊仔細分析過 Twitch 內容傳遞網路的架構。然而,這樣的實驗結果是一次性且高成本的。由於 Twitch 近年來的高速發展,加上缺乏長期監控的方法,大眾對於 Twitch 的內容傳遞網路的資訊所知相當有限。
在我們實驗室先前的成果中,我們成功使用捉放法中的 CJS 模型來預估Twitch 內容傳遞網路的伺服器數量。然而在這個模型當中,所有伺服器的存活率與被抓取率皆為相同—此假設明顯不符合實際情況。如果假設每個伺服器都有各自的存活率與被抓取率,那麼 CJS 模型將可能會花費許多時間來計算。此外,對於太小的分群,CJS 模型中的最大概似估計將可能有巨大誤差。
因此,在我的研究中,我總共對五個地區的資料進行分群。我試著以在不同時段的 transaction count 作為分群用的屬性,將擁有相似存活率與被抓取率的伺服器分在同一群,藉此實現異質化的 CJS 模型。一開始,我使用 S_Dbw 來分析分群的結果。然而,我發現 S_Dbw 無法幫助預測 CJS 模型預測錯誤率。因此,我提出了 Avg/Std 來分析 CJS 的結果,越大的 Avg/Std 傾向會有越大的錯誤率。
zh_TW
dc.description.abstractStreaming media become more and more popular and important in recent years. Among streaming platforms, Twitch is dominating the game streaming market. In 2021, Twitch had 2,778,000 average concurrent viewers and 105,000 average concurrent stream ers.
Similar to other streaming media, Twitch uses Content Delivery Network to provide the service to massive viewers from all around the world. Content Delivery Network (CDN), which is the key part of the streaming system, is crucial for the quality of service.
In the early work, a one-time experiment has been done to survey Twitch’s CDN. However, due to the rapid growth of Twitch and the high cost of a detailed scan on CDN, Twitch’s CDN remains largely unknown to the public.
In our previous work, we used the CJS model, which assumes every individual shares the same time-dependent survival rate and capture probability, to estimate the CDN size. However, different servers may have different survival rates and capture probability. If we assume every server has its own survival rate and capture probability, the computation overhead of the CJS model may be too high since there are many parameters needed to estimate. Besides, maximum likelihood estimation would have a large bias if the sample size is too small [13].
In this research, I use the transaction count in hour periods to do clustering on the data from 5 countries and use the CMR model with heterogeneity with these clustering results. Next, I use S_Dbw score [7] to evaluate the clustering results. However, I find a better S_Dbw score does not lead to have a lower error rate in the MLE-CJS model. Instead, if Avg/Std in the number of sample servers larger than 0.3 of a cluster, it will tend to have a larger the estimation error rate. As a result, the clustering results with number of clusters less than 5 tend to have a lower estimation error rate since these clustering results contain less clusters with Avg/Std larger than 0.3.
en
dc.description.provenanceSubmitted by admin ntu (admin@lib.ntu.edu.tw) on 2024-01-26T16:17:53Z
No. of bitstreams: 0
en
dc.description.provenanceMade available in DSpace on 2024-01-26T16:17:53Z (GMT). No. of bitstreams: 0en
dc.description.tableofcontentsAcknowledgements iii
摘要 v
Abstract vii
Contents ix
List of Figures xiii
List of Tables xvii

Chapter 1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research Goal and Challenges . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2 Related Works 5
2.1 Capture-Recapture-Mark Model . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Lincoln-Petersen Model . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Cormack-Jolly-Seber Model . . . . . . . . . . . . . . . . . . . . . 6
2.2 CMR model with Heterogeneity . . . . . . . . . . . . . . . . . . . . 8
2.3 Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Mini-Batch K-Means . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Mean Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Clustering Evaluation - Sdb_w . . . . . . . . . . . . . . . . . . . . . 12
Chapter 3 Pilot Experiment 15
3.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Twitch CDN Discovery . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Data from different regions . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Subnet Overlook: 24 Subnet Mask . . . . . . . . . . . . . . . . . . 22
3.3.2 Hour-Count Distribution . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.1 Data in the US . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.2 Data in Other Regions . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter 4 Clustering Method 33
4.1 Number of Servers Every Hour in US-1 . . . . . . . . . . . . . . . . 33
4.2 K-Means Clustering of US-1 - Preliminaries . . . . . . . . . . . . . . 37
4.2.1 Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.2 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 Internal Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Alternative Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Alternative Clustering Method . . . . . . . . . . . . . . . . . . . . . 48
4.5 Alternative Clustering Metrics . . . . . . . . . . . . . . . . . . . . . 49
4.6 Random Clustering - Baseline . . . . . . . . . . . . . . . . . . . . . 50
Chapter 5 CJS Estimation Error 53
5.1 Population Estimation of CJS Model - Preliminaries . . . . . . . . . 53
5.1.1 Estimation Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.2 Dig into Cluster 2 in K-Means . . . . . . . . . . . . . . . . . . . . 55
5.2 CJS with Multiple K-Means Clustering Results . . . . . . . . . . . . 57
5.2.1 Clustering Result with US-1 - May 07 to May 16 . . . . . . . . . . 58
5.2.1.1 S_Dbw Score and Estimation Error Rate . . . . . . . . 58
5.2.1.2 Alternative Clustering Metrics . . . . . . . . . . . . . 60
5.2.2 Clustering Result with US-0 - April 29 to May 05 . . . . . . . . . . 65
5.2.2.1 S_Dbw Score and Estimation Error Rate . . . . . . . . 65
5.2.2.2 Alternative Clustering Metrics . . . . . . . . . . . . . 67
5.3 CJS with Random Clustering Results . . . . . . . . . . . . . . . . . 71
5.3.1 Random Clustering Results in US-0 . . . . . . . . . . . . . . . . . 71
5.3.2 Random Clustering Results in US-1 . . . . . . . . . . . . . . . . . 74
Chapter 6 Sensitivity Analysis 77
6.1 CJS Model in US-0 V.S. US-1 . . . . . . . . . . . . . . . . . . . . . 77
6.1.1 Estimation Error Rate in the US . . . . . . . . . . . . . . . . . . . 78
6.1.2 Cluster Error Rate in the US . . . . . . . . . . . . . . . . . . . . . 79
6.2 CJS Model in the United Kingdom . . . . . . . . . . . . . . . . . . . 80
6.2.1 S_Dbw Score and Estimation Error Rate . . . . . . . . . . . . . . . 80
6.2.2 Why CJS Model Cannot Fit Well in the UK . . . . . . . . . . . . . 83
6.2.3 CJS Model with Random Clustering . . . . . . . . . . . . . . . . . 86
6.3 CJS Model in France . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3.1 S_Dbw Score and Estimation Error Rate . . . . . . . . . . . . . . . 91
6.3.2 Why CJS Model Can Fit in France . . . . . . . . . . . . . . . . . . 94
6.3.3 CJS Model with Random Clustering . . . . . . . . . . . . . . . . . 95
6.4 CJS Model in the Netherlands . . . . . . . . . . . . . . . . . . . . . 98
6.4.1 S_Dbw Score and Estimation Error Rate . . . . . . . . . . . . . . . 98
6.4.2 Why CJS Model Cannot Fit Well in the Netherlands . . . . . . . . . 100
6.4.3 CJS Model with Random Clustering . . . . . . . . . . . . . . . . . 101
6.5 CJS Model in Germany . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5.1 S_Dbw Score and Estimation Error Rate . . . . . . . . . . . . . . . 106
6.5.2 Why CJS Model Cannot Fit Well in Germany . . . . . . . . . . . . 107
6.5.3 CJS Model with Random Clustering . . . . . . . . . . . . . . . . . 109
Chapter 7 Computation Time of the CJS Model 113
7.1 Computation Time in the US Data . . . . . . . . . . . . . . . . . . . 113
7.2 Computation Time in Different Regions . . . . . . . . . . . . . . . . 114
Chapter 8 Discussions 117
8.1 Online Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Chapter 9 Conclusion and Future Work 119
9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
References 125
-
dc.language.isoen-
dc.subject分群zh_TW
dc.subjectCormack-Jolly-Seber 模型zh_TW
dc.subject捉放法zh_TW
dc.subject內容傳遞網路zh_TW
dc.subjectTwitchzh_TW
dc.subjectCormack-Jolly-Seber Modelsen
dc.subjectTwitchen
dc.subjectContent Delivery Networken
dc.subjectCapture-Recapture Modelsen
dc.subjectClusteringen
dc.title以異質化捉放法估算內容傳遞網路大小的資料分析zh_TW
dc.titleData analysis for CDN server population estimation based on CMR model with heterogeneityen
dc.typeThesis-
dc.date.schoolyear112-1-
dc.description.degree碩士-
dc.contributor.oralexamcommittee陳伶志;林靖茹zh_TW
dc.contributor.oralexamcommitteeLing-Jyh Chen;Ching-Ju Linen
dc.subject.keywordTwitch,內容傳遞網路,捉放法,分群,Cormack-Jolly-Seber 模型,zh_TW
dc.subject.keywordTwitch,Content Delivery Network,Capture-Recapture Models,Clustering,Cormack-Jolly-Seber Models,en
dc.relation.page127-
dc.identifier.doi10.6342/NTU202400040-
dc.rights.note同意授權(全球公開)-
dc.date.accepted2024-01-15-
dc.contributor.author-college電機資訊學院-
dc.contributor.author-dept電信工程學研究所-
顯示於系所單位:電信工程學研究所

文件中的檔案:
檔案 大小格式 
ntu-112-1.pdf7.37 MBAdobe PDF檢視/開啟
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved