實用性導向頻繁集勘測

Kun-Ta Chuang; 莊坤達

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/33216

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	陳銘憲(Ming-Syan Chen)
dc.contributor.author	Kun-Ta Chuang	en
dc.contributor.author	莊坤達	zh_TW
dc.date.accessioned	2021-06-13T04:29:41Z	-
dc.date.available	2006-07-21
dc.date.copyright	2006-07-21
dc.date.issued	2006
dc.date.submitted	2006-07-20
dc.identifier.citation	[1] F. Afrati, A. Gionis, and H. Mannila. Approximating a Collection of Frequent Sets. In Proc. of ACM SIGKDD, 2004. [2] R. Agrawal, T. Imielinski, andA. Swami.Mining association rules between sets of items in large databases. In Proc. of SIGMOD, 1993. [3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of VLDB, 1994. [4] A. V. Aho, R. Sethi, and J. D. Ullman. Compliers. Principles, Techniques and Tools. Addison- Wesley, 1986. [5] A. Arasu and G. S. Manku. Approximate counts and quantiles over sliding windows. In Proc. of PODS, 2004. [6] N. Ayan, A. Tansel, and E. Arkun. An Efficient Algorithm to Update Large Itemsets with Early Pruning. In Proc. of ACM SIGKDD, 1999. [7] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. of PODS, 2002. [8] B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. In Proc. of 13th Annual ACM-SIAM Symposium on Discrete Algorithms, 2002. [9] B. Babcock, M. Datar, R. Motwani, and L. O’Callaghan. Maintaining variance and k-medians over data stream windows. In Proc. of PODS, 2003. [10] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999. [11] J. Beran. Statistics for Long-Memory Processes. Monographs on Statistics and Applied Probability, Chapman and Hall, 1994. [12] C. Bettini, X. S.Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. IEEE Data Eng. Bull, 1998. [13] C. Bettini, X. S. Wang, S. Jajodia, and J.-L. Lin. Discovering frequent event patterns with multiple granularities in time sequences. IEEE Trans. on Knowledge and Data Engineering, 1998. [14] Z. Bi, C. Faloutsos, and F. Korn. The 'DGX' Distribution for Mining Massive, Skewed Data. In Proc. of ACM SIGKDD, 2000. [15] C. Blake and C. Merz. UCI repository of machine learning databases, 1998. [16] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and zipf-like distributions: Evidence and implications. In Proc. of IEEE INFOCOM, 1999. [17] S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. of SIGMOD, 1997. [18] H. Bronnimann, B. Chen, M. Dash, P. J. Haas, and P. Scheuermann. Efficient Data Reduction with EASE. In Proc. of SIGKDD, 2003. [19] A. G. Buchner and M. D. Mulvenna. Discovering internet marketing intelligence through online analytical web usage mining. SIGMOD Record, 27(4):54–61, 1998. [20] B. Chen, P. Haas, and P. Scheuermann. A new two-phase sampling based algorithm for discovering association rules. In Proc. of SIGKDD, 2002. [21] H.-L. Chen, K.-T. Chuang, and M.-S. Chen. Toward Highly Efficient Clustering: A Framework of Labeling Unclustered Categorical Data into Clusters Based on the Important Attribute Values. In Proc. of the IEEE 5th Intern’l Conf. on Data Mining (ICDM), 2005. [22] M.-S. Chen, J.-S. Park, and P. S. Yu. Efficient Data Mining for Path Traversal Patterns. IEEE Transactions on Knowledge and Data Engineering, 1998. [23] X. Chen and I. Petr. Discovering temporal association rules: Algorithms, language and system. In Proc. of IEEE ICDE, 2000. [24] D. Cheung, J. Han, V. Ng, and C.Wong. Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique. In Proc. of IEEE ICDE, 1996. [25] Y. Cheung and A. Fu. Mining Association Rules without Support Threshold: with and without Item Constraints. In TKDE, 2004. [26] Y. Chi, H.Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining closed frequent itemsets over a stream sliding window. In Proc. of ICDM, 2004. [27] K.-T. Chuang and M.-S. Chen. Clustering Categorical Data by Utilizing the Correlated-Force Ensemble. In Proc. of the 4th SIAM Intern’l Conference on Data Mining (SDM), 2004. [28] K.-T. Chuang and M.-S. Chen. Frequent Pattern Discovery with Memory Constraint. In Proc. of the ACM 14th Intern’l Conf. on Information and Knowledge Management (CIKM), 2005. [29] K.-T. Chuang, M.-S. Chen, and W.-C. Yang. Progressive sampling for association rules based on sampling error estimation. In Proc. of PAKDD, 2005. [30] K.-T. Chuang, J.-L. Huang, and M.-S. Chen. On Exploring the Power-Law Relationship in the Itemset Support Distribution. In Proc. of the 10th Intern’l Conf. on Extending Database Technology (EDBT), 2006. [31] W. G. Cochran. Sampling Techniques. John Wiley and Sons, 1977. [32] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. In Proc. of ICDE, 2000. [33] S. Cong, J. Han, J. H., and D. A. Padua. A sampling-based framework for parallel data mining. In Proc. of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2005. [34] G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In Proc. of SIAM SDM, 2005. [35] M. E. Crovella and A. Bestavros. Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes. In Proc. of ACM SIGMETRICS, 1996. [36] G. Das, K.-I. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule discovery from time series. In Knowledge Discovery and Data Mining, pages 16–22, 1998. [37] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. In Proc. of SODA, 2002. [38] S. Dill, R. Kumar, K. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins. Self-similarity in the web. In Proc. of VLDB, 2001. [39] C. Domingo, R. Gavalda, and O.Watanabe. Adaptive SamplingMethods for Scaling Up Knowledge Discovery Algorithms. Data Mining and Knowledge Discovery, 2002. [40] G. Dong and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and Differences. In Proc. of SIGKDD, 1999. [41] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In Proc. of ICML, 1995. [42] L. Egghe. The distribution of n-grams. Scientometrics, 2000. [43] C. Faloutsos. Next Generation Data Mining Tools: Power Laws and Self-similarity for Graphs, Streams and Traditional Data. ECML, 2003. [44] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In Proc. of ACM SIGCOMM, 1999. [45] V. Ganti, M.-L. Lee, and R. Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In Proc. of VLDB, 2000. [46] M. Garofalakis and P. B. Gibbons. Wavelet synopses with error guarantees. In Proc. of SIGMOD, 2002. [47] F. Geerts, B. Goethals, and J. V. D. Bussche. Tight upper bounds on the number of candidate patterns. ACM Transactions on Database Systems, 2005. [48] F. Geerts, B. Goethals, and J. V. d. Bussche. A tight upper bound on the number of candidate patterns. In Proc. of IEEE ICDM, 2001. [49] S. Ghahramani. Fundamentals of Probability, 2nd Edition. Prentice Hall, 1999. [50] A. Ghoting, G. Buehrer, S. Parthasarathy, Y. D. Kim, A. Nguyen, and P. Dubey. Cache-conscious frequent pattern mining on a modern processor. In Proc. of VLDB, 2005. [51] P. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proc. of VLDB, 2001. [52] P. B. Gibbons and Y.Matias. New sampling-based summary statistics for improving approximate query answers. In Proc. of SIGMOD, 1998. [53] P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. In Proc. of VLDB, 1997. [54] B. Goethals. Memory issues in frequent itemset mining. In Proc. of SAC, 2004. [55] B. Goethals andM. J. Zaki. Advances in Frequent ItemsetMining Implementations: Introduction to FIMI03. In Proc. of Workshop on Frequent Itemset Mining Implementations, 2003. [56] S. Guha, K. Shim, and J. Woo. REHIST: Relative Error Histogram Construction Algorithms. In Proc. of VLDB, 2004. [57] P. J. Haas and A. N. Swami. Sequential sampling procedures for query size estimation. In Proc. of SIGMOD, 1992. [58] J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. In Proc. of ICDE, 1999. [59] J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In Proc. of 1995 Int’l Conf. on Very Large Data Bases, 1995. [60] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. [61] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. of ACM SIGMOD, 2000. [62] J. Han, J. Pei, and Y. Yin. Mining Frequent PatternsWithout Candidate Generation. Proceedings of ACM SIGMOD International Conference on Management of Data, 2000. [63] J. Han, J. Pei, Y. Yin, and R. Mao. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. In DMKD, 2004. [64] J. Hipp, U. Guntzer, and G. Nakhaeizadeh. Algorithms for association rule mining - a general survey and comparison. In SIGKDD Explorations, 2000. [65] W.-C. Hou, G. Ozsoyoglu, and E. Dogdu. Error-constrained count query evaluation in relational databases. In Proc. of SIGMOD, 1991. [66] C.-N. Hsu and C. Knoblock. Discovering robust knowledge from databases that change. Data Mining and Knowledge Discovery, pages 69–95, 1998. [67] Y. Ioannidis. The history of histograms. In Proc. of VLDB, 2003. [68] T. Johnson, S. Muthukrishnan, and I. Rozenbaum. Sampling algorithms in a stream operator. In Proc. of SIGMOD, 2005. [69] R. Koch. The 80/20 Principle: The Secret of Achieving More With Less. Nicholas Brealey Publishing, 1998. [70] G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold. Efficient biased sampling for approximate clustering and outlier detection in large datasets. In IEEE Trans. on Knowledge and Data Engineering, 2003, 2003. [71] C.-H. Lee, C.-R. Lin, and M.-S. Chen. Sliding-Window Filtering: An Efficient Algorithm for Incremental Mining. Proc. of CIKM, 2001. [72] S. D. Lee, D. W.-L. Cheung, and B. Kao. Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules. DMKD, 2(3):233–262, 1998. [73] P. S. Levy and S. Lemeshow. Sampling of Populations: Methods and Applications. John Wiley and Sons, 1991. [74] J.-L. Lin andM. H. Dunham. Mining association rules: Anti-skew algorithms. In Proc. of ICDE, 1998. [75] R. J. Lipton and J. F. Naughton. Query size estimation by adaptive sampling. Journal of Computer and System Science, 1995. [76] B. Liu, W. Hsu, and Y. Ma. Mining Association Rules with Multiple Minimum Supports. In Proc. of ACM SIGKDD, 1999. [77] H. Lu, L. Feng, and J. Han. Beyond Intra-Transaction Association Analysis: Mining Multi- Dimensional Inter-Transaction Association Rules. ACM Transactions on Information Systems, 2000. [78] G. S. Manku and R. Motwani. Approximate frequency counts over streaming data. In Proc. of VLDB, 2002. [79] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. In Data Mining and Knowledge Discovery, 1997. [80] Y. Matias, J. S. Vitter, and M. Wang. Dynamic maintenance of wavelet-based histograms. In Proc. of VLDB, 2000. [81] A. Metwally, D. Agrawal, and A. E. Abbadi. Efficient computation of frequent and top-k elements in data streams. In Proc. of ICDT, 2005. [82] A. Nanopoulos, Y. Theodoridis, and Y. Manolopoulos. An efficient and effective algorithm for density biased sampling. In Proc. of the ACM CIKM Conf., 2002. [83] R. T. Ng and J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. Proceedings of the 20th VLDB Conference, 1994. [84] S. Orlando, C. Lucchese, P. Palmerini, R. Perego, and F. Silvestri. kDCI: a Multi-Strategy Algorithm for Mining Frequent Sets. In Proc. of Workshop on Frequent Itemset Mining Implementations, 2004. [85] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. Adaptive and resource-aware mining of frequent sets. In Proc. of IEEE ICDM, 2002. [86] B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. In Proc. of ICDE, 1998. [87] C. R. Palmer and C. Faloutsos. Density Biased Sampling: An ImprovedMethod for DataMining and Clustering. Proc. of SIGMOD, 2000. [88] J.-S. Park, M.-S. Chen, and P. S. Yu. An effective hash based algorithm for mining association rules. In Proc. of ACM SIGMOD, 1995. [89] S. Parthasarathy. Efficient progressive sampling for association rules. In Proc. of IEEE ICDM, 2002. [90] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. of ICDT, 1999. [91] J. Pei and J. Han. Can We Push More Constraints into Frequent Pattern Mining? In Proc. of ACM SIGKDD, 2000. [92] J. Pei, J. Han, B. M.-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan mining sequential patterns efficiently by prefix projected pattern growth. In In Proc. of IEEE ICDE, 2001. [93] F. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Proc. of ACM SIGKDD, 1999. [94] G. Ramesh, W. A. Maniatty, and M. J. Zaki. Feasible itemset distributions in data mining: Theory and application. In Proc. of ACM PODS, 2003. [95] J. A. Rice. Mathematical statistics and data analysis. Duxbury Press, 1995. [96] J. F. Roddick and M. Spiliopoulou. A survey of temporal knowledge discovery paradigms and methods. In IEEE Trans. on Knowledge and Data Engineering, 2002. [97] R. L. Scheaffer, W. Mendenhall, and R. L. Ott. Elementary Survey Sampling. Duxbury Press, 1995. [98] M. Spiliopoulou and J. F. Roddick. Higher order mining: Modelling and mining the results of knowledge discovery. In Proc. of Intel. Conf. on Data Mining Methods and Databases, 2000. [99] R. Srikant and R. Agrawal. Mining Generalized Association Rules. IBM Research Report, 1995. [100] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In Proc. of ACM SIGMOD, 1996. [101] R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proc. of EDBT, 1996. [102] S. Y. Sung, Z. Li, C. L. Tan, and P. A. Ng. Forecasting association rules using existing data sets. In IEEE Trans. on Knowledge and Data Engineering, 2003. [103] P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. In Proc. of ACM SIGKDD, 2002. [104] W.-G. Teng, M.-S. Chen, and P. S. Yu. A regression-based temporal pattern mining scheme for data streams. In Proc. of VLDB, 2003. [105] S. K. Thompson and G. A. F. Seber. Adaptive Sampling. WILEY Series in Probability and Statistics, 1996. [106] H. Toivonen. Sampling large databases for association rules. In Proc. of VLDB, 1996. [107] A. K. H. Tung, H. Lu, J. Han, and L. Feng. Efficient Mining of Intertransaction Association Rules. In IEEE Trans. on Knowledge and Data Engineering, 2003. [108] J. Vitter. Random sampling with a reservoir. In ACM Trans. on Mathematical Software, 1985. [109] J. S. Vitter. An efficient algorithm for sequential random sampling. In ACM Trans. on Mathematical Software, Vol. 13, No. 1, 1987. [110] A. Wald. Sequential Analysis. Wiley, 1947. [111] J.Wang, J. Han, Y. Lu, and P. Tzvetkov. TFP: An Efficient Algorithm forMining Top-K Frequent Closed Itemsets. In TKDE, 2005. [112] K. Wang, Y. He, and J. Han. Mining frequent itemsets using support constraints. In Proc. of VLDB, 2000. [113] K. Wang, S. Zhou, and S. Liew. Building Hierarchical Classifiers Using Class Proximity. In Proc. of VLDB, 1999. [114] W. Wang, J. Yang, and R. R. Muntz. TAR: Temporal Association Rules on Evolving Numerical Attributes. In Proc. of ICDE, 2001. [115] W.Willinger, M. S. Taqqu,W. E. Leland, and D. V.Wilson. Self-similarity in high-speed packet traffic: analysis and modelling of ethernet traffic measurements. Statistical Science, 10(1), 1995. [116] I. H. Witten and E. Frank. Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco, 1999. [117] R. C.-W. Wong and A. Fu. Mining top-k itemsets over a sliding window based on zipfian distribution. In Proc. of SIAM SDM, 2005. [118] Y. Xiao andM. H. Dunham. Considering main memory in mining association rules. In Proc. of DAWAK, 1999. [119] C. Yang, U. Fayyad, and P. Bradley. Efficient discovery of error-tolerant frequent itemsets in high dimensions. In Proc. of ACM SIGKDD, 2001. [120] J. X. Yu, Z. Chong, H. Lu, and A. Zhou. False positive or false negative: Mining frequent itemsets from high speed transactional data streams. In Proc. of VLDB, 2004. [121] P. S. Yu, M.-S. Chen, H. Heiss, and S. Lee. On workload characterization of relational database environments. IEEE Trans. on Software Engineering, 1992. [122] C.-H. Yun and K.-T. C.M.-S. Chen. Using Category-Based Adherence to ClusterMarket-Basket Data. In Proc. of the IEEE 2nd Intern’l Conf. on Data Mining (ICDM), 2002. [123] M. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluation of sampling for data mining of association rules. In Int. Workshop on Research Issues in Data Engineering, 1997. [124] M. J. Zaki and C.-J. Hsiao. Charm: An efficient algorithm for closed itemset mining. In Proc. of SIAM SDM, 2002. [125] M. J. Zaki, S. Parthasarathy,M. O., andW. Li. New algorithms for fast discovery of association rules. In Proc. of ACM SIGKDD, 1997. [126] Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In Proc. of SIGKDD, 2001. [127] G. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, 1949.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/33216	-
dc.description.abstract	自從早期頻繁集探勘演算法Apriori 被提出之後，相關頻繁集探勘技術已被廣泛地提出及討論。然而，即使這些技術可以滿足某些現實生活中的應用要求，我們發現，如何提供可實用之頻繁集探勘技術，達到容易使用，低複雜度，高效率等實用課題，依然是一個極富挑戰性問題。有鑑於此，我們在本論文中提出一可直接尋找出top-k (closed) 頻繁集之探勘技術，同時考慮到其所需之執行記憶體上限。不同於以往只考慮如何快速得到正確的頻繁集之技術，我們探討如何限定所需記憶體的上限。同時在這樣的要求下，所提出之MTK 及MTK_Close 演算法可以不用設定使用者難以理解的最小支持度參數。相反的，使用者只需設定其容易理解的參數，也就是所需之頻繁集個數。實驗上證明，MTK 跟MTK_Close 可以同時達成高執行效率及考慮到記憶體上限。此外，本論文亦提出一在滑動視窗模型下，可依序產生高品質樣本之資料取樣技術，稱之為FPS 演算法。在此我們考慮之高品質樣本，指的是每一屬性值之母體比率及樣本比率在一滑動視窗下之相似程度。FPS 演算法有幾個好處：(1)它可以連續地從一個隨時間變化的資料源中產生樣本；(2)FPS 的執行時間相對資料的大小是線性成長的；(3)對大多數屬性值來說，其相對比例誤差可被保證在一定的錯誤上限之下，而剩餘的屬性值的相對比例誤差亦可被極盡的縮小，可保證所產生之樣本是高品質之樣本；(4)產生樣本之取樣比率可以接近使用者所指定的比率，無需增加樣本大小來達成高品質樣本的需求；(5) FPS 亦能優異地保持多維度統計量的比率；(6) FPS 不為特定應用而設計，因此可以使用於不同之探勘應用。在本論文中，我們也觀察了一個實際資料中存在的重要特徵，稱之為itemset supportdistribution，用以提供對實際資料特性更好的模型分析。重要地，從觀察不同的實際資料後，我們確認，Power-Law 的現象的確出現於itemset support distribution 中，並且我們可以用Zipf distribution 來描繪此現象。然後由於要尋找出這個特徵是十分費時的，我們更進一步地，提出一有效率之演算法，用於快速正確的抓取出itemset support distribution 的數學模型特徵。此外，我們也充分利用這個實際資料中發現的特性，提出新穎的機制解決二個重要的探勘問題：(1)自動決定一個在資料串流環境中頻繁集探勘演算法所需之參數；(2)決定頻繁集探勘演算法所需之最低樣本大小。在本論文中，我們也試圖回答一個重要的探勘技術問題：『那些子集將在未來是頻繁集？』這樣的探勘樣式，我們稱之為預期中的頻繁集(Prospective Frequent Patterns)，對使用者是十分有意義的探勘結果。因為許多交互推銷的策略在真實應用中的成功與否，是依靠在頻繁集的出現能否被精確地預言。由於以往的技術不能用來有效地得到預期中的頻繁集，我們在本論文中亦提出一架構，可用來精確地預測預期中的頻繁集，同時也預測他們可能的支持度。	zh_TW
dc.description.abstract	Since the early work in algorithm Apriori, a broad spectrum of topics in mining frequent patterns has been studied. While those proposed techniques are important results toward the integration of mining association mining and other real-life requirements, how to provide feasibility-oriented models for mining frequent patterns, to enable easy-use, low-cost, high-efficiency, and realistic mining applications, still remains as a challenging issue. In view of this, we explore in this dissertation a novel algorithm of mining top-k (closed) itemsets in the presence of the memory constraint. As opposed to most previous works that concentrate on improving the mining efficiency or on reducing the memory size by best effort, we first attempt to specify the available upper memory size that can be utilized by mining frequent itemsets. While complying with the requirement of the memory constraint, two efficient algorithms, called MTK and MTK_Close, were thus devised for mining frequent itemsets and closed itemsets, respectively, without specifying the subtle minimum support. Instead, users only need to give a more human-understandable parameter, namely the desired number of frequent (closed) itemsets k. Furthermore, a sampling model, called feature preserved sampling (FPS) that sequentially generates a high-quality sample over sliding windows, is developed. The sampling quality we consider refers to the degree of consistency between the sample proportion and the population proportion of each attribute value in a window. FPS has several advantages: (1) it sequentially generates a sample from a time-variant data source over sliding windows; (2) the execution time of FPS is linear with respect to the database size; (3) the relative proportional differences between the sample proportions and population proportions of most distinct attribute values are guaranteed to be below a specified error threshold, ε, while the relative proportion differences of the remaining attribute values are as close to ε as possible, which ensures that the generated sample is of high-quality; (4) the sample rate is close to the user specified rate so that a high-quality sampling result can be obtained without increasing the sample size; (5) FPS can excellently preserve the population proportion of multivariate statistics in the sample; and (6) FPS can be applied to infinite streams and finite datasets equally, and the generated samples can be used for various applications. We next investigate an important characteristic in real datasets, named the itemset support distribution, to provide better understanding on real datasets. The itemset support distribution refers to the distribution of the count of itemsets versus the itemset support. Importantly, from observations on various retail datasets and as validated by our empirical studies later, we find that the power-law relationship indeed appears in the itemset support distribution and we can characterize that as a Zipf distribution. Since it is prohibitively expensive to retrieve lots of itemsets before we identify the characteristics of the itemset support distribution in targeted data, we also propose a valid and cost-effective algorithm, called algorithm PPL, to extract characteristics of the itemset support distribution. Furthermore, to fully explore the advantages of our discovery, we also propose novel mechanisms with the help of PPL to solve two important problems: (1) determining a subtle parameter for mining approximate frequent itemsets over data streams; and (2) determining the sufficient sample size for mining frequent patterns. In this dissertation, we also attempt to answer an important question: 'What patterns will be frequent in the future?' Such a kind of patterns, referred to as prospective frequent patterns, is very informative to end-users, because many cross-selling strategies in real cases rely on the precise prediction of frequent patterns that will appear. Since any naive extension of previous works cannot effectively obtain the desired result, we proposed the framework of PFP, to precisely predict prospective frequent patterns while also predicting their supports.	en
dc.description.provenance	Made available in DSpace on 2021-06-13T04:29:41Z (GMT). No. of bitstreams: 1 ntu-95-F89921134-1.pdf: 3194066 bytes, checksum: 78e1301de064c5200782c7fb3e5e9df0 (MD5) Previous issue date: 2006	en
dc.description.tableofcontents	1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Overview of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Mining Top-k Frequent Patterns with the Memory Constraint . . . . . . . . . . 3 1.2.2 On-line Generation of High-Quality Samples for Data Mining Applications . . 4 1.2.3 The Discovery of the Power-Law Relationship in the Itemset Support Distribution and Its Importance to Application Designs . . . . . . . . . . . . . . . . 5 1.2.4 Mining Prospective Frequent Patterns . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Mining Top-k Frequent Patterns in the Presence of the Memory Constraint 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Memory-Constraint Frequent-Pattern Mining . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Algorithm Naive: The Baseline Method to Discover Frequent Patterns in the Presence of the Memory Constraint . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Memory-Constraint Top-k Frequent-Pattern Mining . . . . . . . . . . . . . . . . . . . 20 2.3.1 Principles to Search Top-k Frequent Patterns . . . . . . . . . . . . . . . . . . 20 2.3.2 The δ-stair Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 AlgorithmMTK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Implementation of MTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2 Extension of MTK to Mine Top-k Closed Itemsets . . . . . . . . . . . . . . . 31 2.4.3 Illustrative Running Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5 Experimental Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.1 Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5.2 Experiments on Mining Top-k Frequent Itemsets . . . . . . . . . . . . . . . . 38 2.5.3 Experimentson Mining Top-k Closed Itemsets . . . . . . . . . . . . . . . . . 43 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3 Feature Preserved Sampling for Incremental Data Mining 47 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3 Analysis of Feature Preserved Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.1 Fundamental Mathematical Analysis of Feature Preserved Sampling . . . . . . 58 3.3.2 Advanced Discussionof Feature Preserved Sampling . . . . . . . . . . . . . . 65 3.4 Algorithm FPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.1 Implementation of FPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.2 Illustrative Examples of FPS . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.4.3 Complexity Analysis of FPS . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.5.1 Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.5.2 Sensitivity Analysis of FPS . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.5.3 Application Study: Association Rules and Clustering . . . . . . . . . . . . . . 84 3.5.4 Application Study: Temporal Pattern Discoveryover Data Streams . . . . . . . 86 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4 Power-Law Relationship and Self-Similarity in the Itemset Support Distribution 89 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.1.1 Motivating Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.1.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2 Identify the Power-Law Relationship and Self-Similarity in the Itemset Support Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2.1 Review of the Power-Law Relationship and Self-Similarity . . . . . . . . . . . 94 4.2.2 Observations on the Power-Law Relationship in Itemset Support Distribution . 96 4.2.3 Observations on Self-Similarity in Itemset Support Distribution . . . . . . . . 98 4.3 Design of Algorithm PPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.3.1 Phase I:Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.3.2 Phase II: Discover High-Support Itemsets in the Sample . . . . . . . . . . . . 107 4.3.3 Phase III: Characterize the Power-Law Relationship . . . . . . . . . . . . . . 108 4.4 Experimental Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.4.1 Performance Studies of Algorithm PPL . . . . . . . . . . . . . . . . . . . . . 111 4.4.2 Application Study: False Positive or False Negative of Frequent Itemsets . . . 115 4.4.3 Application Study: Sufficient Sample Size for Frequent Itemsets . . . . . . . . 117 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5 On Exploring Prospective Frequent Patterns in Transaction Databases 125 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.1.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.1.3 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.2 Prospective Frequent Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.2.1 Support-Deviation Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.2.2 Triggering Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.2.3 Exploring Prospective Frequent Patterns . . . . . . . . . . . . . . . . . . . . . 140 5.3 Framework PFP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.3.1 Algorithm Level Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.3.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.4 Experimental Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.4.2 Scalability Analysis on the Synthetic Dataset . . . . . . . . . . . . . . . . . . 151 5.4.3 Sensitivity Studies on Real-World Datasets . . . . . . . . . . . . . . . . . . . 152 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6 Conclusions..................157
dc.language.iso	en
dc.title	實用性導向頻繁集勘測	zh_TW
dc.title	On Feasibility-Oriented Mining of Frequent Patterns	en
dc.type	Thesis
dc.date.schoolyear	94-2
dc.description.degree	博士
dc.contributor.oralexamcommittee	李素瑛(Suh-Yin Lee),李強(Chiang Lee),柯佳伶(Jia-Ling Koh),陳良弼(Arbee L.-P. Chen),陳孟彰(Meng-Chang Chen),曾新穆(Vincent S. M. Tseng),歐陽彥正(Yen-Jen Oyang)
dc.subject.keyword	資料探勘,頻繁集,資料取樣,記憶體,	zh_TW
dc.subject.keyword	Data Mining,Frequent Patterns,Data Sampling,Memory,	en
dc.relation.page	165
dc.rights.note	有償授權
dc.date.accepted	2006-07-21
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電信工程學研究所	zh_TW
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-95-1.pdf 目前未授權公開取用	3.12 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。