以最大概略估計為基礎之捉放法模型估算伺服器數量

Wei-Shiang Wung; 翁瑋襄

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/85394

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	黃寶儀(Polly Huang)
dc.contributor.author	Wei-Shiang Wung	en
dc.contributor.author	翁瑋襄	zh_TW
dc.date.accessioned	2023-03-19T23:16:06Z	-
dc.date.copyright	2022-07-28
dc.date.issued	2022
dc.date.submitted	2022-07-25
dc.identifier.citation	H. Akaike. Information Theory and an Extension of the Maximum Likelihood Principle, pages 199–213. Springer New York, New York, NY, 1998. D. Borchers and M. Efford. Spatially explicit maximum likelihood methods for capture-recapture studies. Biometrics, 64:377–85, 07 2008. T. Böttger, F. Cuadrado, G. Tyson, I. Castro, and S. Uhlig. Open connect everywhere: A glimpse at the internet ecosystem through the lens of the netflix cdn. SIGCOMM Comput. Commun. Rev., 48(1):28–34, Apr. 2018. L. Breiman. Random forests. In Machine Learning, pages 5–32, 2001. C. Brownie and D. S. Robson. Models allowing for age-dependent survival rates for band-return data. Biometrics, 32(2):305–323, 1976. L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108–122, 2013. K. P. Burnham. Design and analysis methods for fish survival experiments based on release-recapture / Kenneth P. Burnham ... [et al.]. American Fisheries Society Bethesda, Md, 1987. Chase. State of the stream april/2020: Valorant and valorant streamers top the charts, May 2020. R. M. CORMACK. Estimates of survival from the sighting of marked animals. Biometrika, 51(3-4):429–438, 12 1964. A. W. F. Edwards. Likelihood. Cambridge University Press, 1972. M. G. Efford and M. R. Schofield. A spatial open-population capture-recapture model. Biometrics, 76(2):392–402, 2020. A. et al. A tale of three cdns: An active measurement study of hulu and its cdns. In 2012 Proceedings IEEE INFOCOM Workshops, pages 7–12, 2012. D. et al. Internet scale user-generated live video streaming: The twitch case. In Proceedings of PAM, pages 60–71, New York, NY, 02 2017. Springer. W. et al. Anatomy of a personalized livestreaming system. In Proceedings of the 2016 Internet Measurement Conference, IMC ’16, page 485–498, New York, NY, USA, 2016. Association for Computing Machinery. C. (evoli) Conners and R. S. Breslau. Wall street journal chart lists twitch.tv fourth in u.s. peak traffic, Feb 2014. D. Giordano, S. Traverso, L. Grimaudo, M. Mellia, E. Baralis, A. Tongaonkar, and S. Saha. Youlighter: A cognitive approach to unveil youtube cdn and changes. IEEE Transactions on Cognitive Communications and Networking, 1(2):161–174, 2015. J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9:1735–80, 12 1997. C. Hsu. Data analysis for cdn server population estimation based on crm model with heterogeneity, 2022. Y. Jin, S. Renganathan, G. Ananthanarayanan, J. Jiang, V. N. Padmanabhan, M. Schroder, M. Calder, and A. Krishnamurthy. Zooming in on wide-area latencies to a global cloud provider. In Proceedings of the ACM Special Interest Group on Data Communication, SIGCOMM ’19, page 104–116, New York, NY, USA, 2019. Association for Computing Machinery. G. M. Jolly. Explicit estimates from capture-recapture data with both death and immigration-stochastic model. Biometrika, 52(1/2):225–247, 1965. C. Krebs. Ecological Methodology, 3rd ed. 2014. https://www.zoology.ubc.ca/~krebs/books.html. J. Laake, D. Johnson, and P. Conn. marked: An r package for maximum likelihood and markov chain monte carlo analysis of capture-recapture data. Methods in Ecology and Evolution, 4:885–890, 09 2013. F. Lincoln. Calculating Waterfowl Abundance on the Basis of Banding Returns. Circular (United States. Department of Agriculture). U.S. Department of Agriculture, 1930. A. Mickunas. How does league＇s worlds viewership compare to the super bowl?, Feb 2018. A. M. Mood, F. A. Graybill, and D. C. Boes. Introduction to the theory of statistics. McGraw-Hill New York, 3rd ed. edition, 1974. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. C. G. J. PETERSEN. The yearly immigration of young plaice into the limfjord from the german sea, ect. Report of the Danish Biological Station for 1895, 6:1–48, 1896. S. Pledger, K. H. Pollock, and J. L. Norris. Open capture-recapture models with heterogeneity: I. cormack-jolly-seber model. Biometrics, 59(4):786–794, 2003. S. Pledger, K. H. Pollock, and J. L. Norris. Open capture–recapture models with heterogeneity: Ii. jolly–seber model. Biometrics, 66(3):883–890, 2010. K. H. POLLOCK. A K-sample tag-recapture model allowing for unequal survival and catchability. Biometrika, 62(3):577–583, 12 1975. B. L. S. Prakasa Rao. Maximum likelihood estimation for markov processes. Annals of the Institute of Statistical Mathematics, 24(1):333–345, Dec 1972. L. Quan, J. Heidemann, and Y. Pradkin. Trinocular: Understanding internet reliability through adaptive probing. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM ’13, page 255–266, New York, NY, USA, 2013. Association for Computing Machinery. D. S. R. R. M. Cormack, G. P. Patil. Sampling Biological Populations. Fairland, Maryland: International Cooperative Publishing House., 1979. C. J. Schwarz and A. N. Arnason. A general methodology for the analysis of capture-recapture experiments in open populations. Biometrics, 52(3):860–873, 1996. G. A. F. Seber. A note on the multiple-recapture census. Biometrika, 52(1-2):249–260, 06 1965. TwitchTracker. Twitch statistics and charts, Aug 2021. C. Wang. Discovering twitch’s video delivery infrastructure utilizing cloud services and vpns, 2021. W.-S. Wung, G.-T. Ting, R.-T. Hsu, C. Hsu, Y.-C. Tsai, C. Wang, Y.-T. Liu, H. Chen, and P. Huang. Twitch＇s cdn as an open population ecosystem. In Asian Internet Engineering Conference, AINTEC ’21, page 56–63, New York, NY, USA, 2021. Association for Computing Machinery.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/85394	-
dc.description.abstract	影音服務的品質與順暢度取決於內容傳遞網路 (CDN) 的規模與設備完善程度。近年來因為影音服務需求的成長，為了因應客戶的需求，內容傳遞網路中的伺服器數量也顯著的增加。因為 Twitch 廣泛的影音應用，我們認為長時間持續續的探究其內容傳遞網路架構是一個重要課題。發想於內容傳遞網路中的伺服器數量的新增淘汰和動物群體出生死亡行為的相似性，我們認為可以套用野生群體數量估算的捉放法，透過每次少量的網路流量取樣以達到估算整體內容傳遞網路的伺服器數量。在我們過去發表的 AINTEC 論文(2021)中， Cormack-Jolly-Seber (CJS) 模型能夠相對準確的估算出伺服器總數。然而，傳統的 CJS 模型的機率假設仍有很多的限制。因為其需要較多採樣間距估算才能收斂，導致這個模型僅限於離線的估算。此外，這個模型假設所有個體被抓捕和生存的機率都是一樣的，這個假設並不符合 Twitch 的內容傳遞網路中伺服器的服務型態。因此，我們引入了考慮異質性、以最大概似估計為基礎的 CJS 捉放法模型來解決這兩個議題。這個模型不僅可以賦予每一台伺服器不同的參數設定，還會以最大概略估計一次性估算 CJS 機率模型中的所有參數。不過每一伺服器都有相對應的參數會導致整個機率模型過於複雜，我們因此使用分群法按照提供服務的模式將伺服器分群，透過讓同一群伺服器共用參數以達到減少模型的參數數量。我們使用 2021 年五月蒐集的資料集做測試，發現以最大概似估計為基礎的 CJS 模型的確在在線估算中有較好的表現，而異質性和伺服器分群在實驗中並未有效提升估算準確率，我們透過檢視各分群的估算結果詳細分析其中原因。	zh_TW
dc.description.abstract	The quality and continuity of the video services such as Twitch depend on the scale and well-being of their content distribution networks (CDNs). Due to the growing demand for video services, server numbers in the CDNs have rapidly increased to feed videos to the clients. Given the widespread use of Twitch, we find continuous survey of its CDN an important subject of study. Inspired by Capture-Mark-Recapture(CMR), a methodology widely used to estimate animal population, we developed a system to continuously observe its CDN size (i.e., the number of servers) with lightweight probing. According to our previous research in AINTEC, the Cormack-Jolly-Seber (CJS) model can estimate the CDN size at each sample time with relatively low errors. Nevertheless, the assumptions of the traditional CJS model are still restrictive. Due to its long converging period, the model can only estimate server population offline. Besides, it assumes that all servers share the same capturing and survival rates, which does not meet the server patterns in Twitch's CDN. Therefore, we introduce the Maximum-Likelihood-Estimation-based (MLE) CJS model with heterogeneity to address these two issues. It not only allows different parameters for each server but also co-estimates all parameters in the CJS probability model. The resulting MLE model is too complicated, and thus we try server clustering to reduce the parameter space. Using a data set collected in May 2021, we find the MLE-based CJS indeed performs better in online estimation. Heterogeneity and server clustering, on the other hand, do not improve the estimation accuracy. For these worse results, we identify the detailed reasons with the estimation results in each group.	en
dc.description.provenance	Made available in DSpace on 2023-03-19T23:16:06Z (GMT). No. of bitstreams: 1 U0001-0507202223022200.pdf: 2876582 bytes, checksum: f7879bc817424f97ba396a62b58c07bb (MD5) Previous issue date: 2022	en
dc.description.tableofcontents	Acknowledgements i 中文摘要ii Abstract iv Contents vi List of Figures ix List of Tables xi Chapter 1 Introduction 1 Chapter 2 Related Work 6 2.1 Lincoln-Peterson Model . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Cormack-Jolly-Seber Model . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Estimation on Server Population with CJS Model . . . . . . . . . . . 9 2.4 MLE-based CJS Model . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 3 Data Set 15 3.1 Introduction to Twitch . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Hourly Server Count . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.2 Daily Server Count . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter 4 Preliminary Experiment 20 4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.2 Capturing and Survival Rate Estimation with Marked . . . . . . . . 21 4.1.3 Server Population Estimation with Capturing Probability . . . . . . 24 4.2 Offline Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 Online Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4 Delayed Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.5 Convergence Periods of the Two CJS Models . . . . . . . . . . . . . 31 Chapter 5 Clustering-based Experiment 32 5.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1.1 Server Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1.1.1 Frequency of Occurrence . . . . . . . . . . . . . . . . 33 5.1.1.2 Server IP Prefix . . . . . . . . . . . . . . . . . . . . . 34 5.1.1.3 K-means Clustering with Transaction Numbers . . . . 35 5.1.2 Server Population Estimation . . . . . . . . . . . . . . . . . . . . . 37 5.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2.1 Frequency of Occurrence . . . . . . . . . . . . . . . . . . . . . . . 38 5.2.2 Server IP Prefix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2.2.1 16-bit IP Prefix . . . . . . . . . . . . . . . . . . . . . 41 5.2.2.2 24-bit IP Prefix . . . . . . . . . . . . . . . . . . . . . 43 5.2.3 K-means Clustering with Transaction Numbers . . . . . . . . . . . 45 5.2.3.1 Clustering by Transaction Numbers in Different Time Periods . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.3.2 Clustering by Transaction Numbers in Different Days . 47 5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3.1 High Estimation Error Rates in Some Groups . . . . . . . . . . . . 50 5.3.2 No Captures in Some Samples . . . . . . . . . . . . . . . . . . . . 51 5.3.3 Computational Overhead of MLE-based CJS Model . . . . . . . . . 53 Chapter 6 Conclusion and Future Work 56 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 References 61
dc.language.iso	en
dc.subject	Twitch	zh_TW
dc.subject	最大概略估計	zh_TW
dc.subject	異質性	zh_TW
dc.subject	伺服器數量估計	zh_TW
dc.subject	捉放法	zh_TW
dc.subject	Heterogeneity	en
dc.subject	Twitch	en
dc.subject	Capture-Mark-Recapture	en
dc.subject	Server Population Estimation	en
dc.subject	Maximum Likelihood Estimation	en
dc.title	以最大概略估計為基礎之捉放法模型估算伺服器數量	zh_TW
dc.title	Estimation on Server Population with MLE-based CMR (Capture-Mark-Recapture)	en
dc.type	Thesis
dc.date.schoolyear	110-2
dc.description.degree	碩士
dc.contributor.oralexamcommittee	林靖茹(Kate Ching-Ju Lin),陳伶志(Ling-Jyh Chen)
dc.subject.keyword	Twitch,捉放法,伺服器數量估計,最大概略估計,異質性,	zh_TW
dc.subject.keyword	Twitch,Capture-Mark-Recapture,Server Population Estimation,Maximum Likelihood Estimation,Heterogeneity,	en
dc.relation.page	65
dc.identifier.doi	10.6342/NTU202201298
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2022-07-25
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電機工程學研究所	zh_TW
dc.date.embargo-lift	2022-07-28	-
顯示於系所單位：	電機工程學系

文件中的檔案：

檔案	大小	格式
U0001-0507202223022200.pdf	2.81 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。