Skip navigation

DSpace

機構典藏 DSpace 系統致力於保存各式數位資料(如:文字、圖片、PDF)並使其易於取用。

點此認識 DSpace
DSpace logo
English
中文
  • 瀏覽論文
    • 校院系所
    • 出版年
    • 作者
    • 標題
    • 關鍵字
    • 指導教授
  • 搜尋 TDR
  • 授權 Q&A
    • 我的頁面
    • 接受 E-mail 通知
    • 編輯個人資料
  1. NTU Theses and Dissertations Repository
  2. 電機資訊學院
  3. 電機工程學系
請用此 Handle URI 來引用此文件: http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/27140
完整後設資料紀錄
DC 欄位值語言
dc.contributor.advisor陳銘憲(Ming-Syan Chen)
dc.contributor.authorChih-Ming Hsuen
dc.contributor.author許志銘zh_TW
dc.date.accessioned2021-06-12T17:56:16Z-
dc.date.available2010-02-18
dc.date.copyright2008-02-18
dc.date.issued2008
dc.date.submitted2008-01-31
dc.identifier.citation[1] N. R. Adam and J. C. Worthmann. Security-control methods for statistical databases: a
comparative study. ACM Computing Surveys, 21(4):515{556, 1989.
[2] C. C. Aggarwal. Re-designing distance functions and distance-based applications for high
dimensional data. SIGMOD Record, 30:13{18, 2001.
[3] C. C. Aggarwal. Towards systematic design of distance functions for data mining applications.
In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 9{18, New York, NY, USA, 2003. ACM Press.
[4] C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of distance
metrics in high dimensional spaces. In Proceedings of the The 8th International Conference
on Database Theory, volume 1973 of Lecture Notes in Computer Science, pages 420{434.
Springer, 2001.
[5] C. C. Aggarwal and P. S. Yu. The igrid index: reversing the dimensionality curse for
similarity indexing in high dimensional space. In Proceedings of the Sixth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pages 119{129, New
York, NY, USA, 2000. ACM Press.
[6] R. Agrawal, C. Faloutsos, and A. Swami. E cient similarity search in sequence databases.
In Proceedings of the 4th International Conference on Foundations of Data Organization
and Algorithm, pages 69{84, 1993.
[7] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of
high dimensional data for data mining applications. SIGMOD Rec., 27(2):94{105, 1998.
[8] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceedings of the 2000
ACM SIGMOD international conference on Management of data, pages 439{450, New
York, NY, USA, 2000. ACM Press.
[9] Albert, H. Jeong, and A.-L. Barabasi. Diameter of the world wide web. Nature, 401:130{
131, 1999.
[10] K. P. Bennett, U. Fayyad, and D. Geiger. Density-based indexing for approximate nearestneighbor
queries. In Proceedings of the Fifth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 233{243, New York, NY, USA, 1999. ACM
Press.
[11] A. Berman and R. J. Plemmons. Nonnegative Matrices in the Mathematical sciences.
SIAM, 1994.
[12] D. Berndt and J. Cli ord. Using dynamic time warping to nd patterns in time series. In
AAAI workshop on Knowledge Discovery in Databases, pages 229{248, Seattle, Washington,
1994.
[13] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbors meaningful?
In Proceedings of the 7th International Conference on Database Theory, volume
1540 of Lecture Notes in Computer Science, pages 217{235. Springer, 1999.
[14] M. Bianchini, M. Gori, and F. Scarselli. Inside pagerank. ACM Transactions on Internet
Technology, 5(1):92{128, 2005.
[15] P. J. Bickel and K. A. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics,
volume 1. Prentice Hall, 2 edition, 2001.
[16] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[17] P. Boldi, M. Santini, and S. Vigna. Pagerank as a function of the damping factor. In
Proceedings of the 14th international conference on World Wide Web, pages 557{566, New
York, NY, USA, 2005. ACM Press.
[18] B. Bollob as, G. Das, D. Gunopulos, and H. Mannila. Time-series similarity problems and
well-separated geometric sets. Nordic J. of Computing, 8(4):409{423, 2001.
[19] R. Boscolo, H. Pan, and V. P. Roychowdhury. Independent component analysis based on
nonparametric density estimation. IEEE Transactions on Neural Networks, 15(1):55{65,
2004.
[20] B. L. Bowerman and R. T. O'Connell. Forecasting and Time Series: An Applied Approach.
Duxbury Press, California, USA, 3 edition, 1993.
[21] G. E. P. Box, G. M. Jenkins, and C. G. Reinsel. Time Series Analysis: Forecasting and
Control. Printice Hall, 3 edition, 1994.
[22] P. J. Brockwell and R. A. Davis. Time Series: Theory and Methods. Springer-Verlag, New
York, 2 edition, 1991.
[23] A. Bulut and A. K. Singh. Swat: Hierarchical stream summarization in large networks.
In Proceedings of the 21th International conference on Data Engineering, pages 303{314,
Bangalore, India, 2003. IEEE Computer Society.
[24] A. Bulut and A. K. Singh. A uni ed framework for monitoring data streams in real time. In
Proceedings of the 21st International conference on Data Engineering, pages 44{55, Tokyo,
Japan, 2005. IEEE Computer Society.
[25] K.-P. Chan and A. W.-C. Fu. E cient time series matching by wavelets. In Proceedings
of the 5th Int'l conference on Data Engineering, pages 126{133, Sydney, Austrialia, 1999.
IEEE Computer Society.
[26] K. Chen and L. Liu. Privacy preserving data classi cation with rotation perturbation. In
ICDM, pages 589{592, 2005.
[27] L. Chen and R. T. Ng. On the marriage of lp-norms and edit distance. In Proceedings of
the Thirtieth International Conference on Very Large Data Bases, pages 792{803. Morgan
Kaufmann, 2004.
[28] L. Chen, M. T. �ozsu, and V. Oria. Robust and fast similarity search for moving object
trajectories. In Proceedings of the 2005 ACM SIGMOD international conference on Man-
agement of data, pages 491{502, 2005.
[29] U. Cherubini, E. Luciano, and W. Vecchiato. Copula Methods in Fiance. Wiley, 2004.
[30] C. K. Chui. An Introduction to Wavelets. Academic Press, 1992.
[31] K. L. Chung. A Course in Probability Theory. Academic Press, 3 edition, 2001.
[32] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms.
The MIT Press, 2 edition, 2001.
[33] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991.
[34] G. Crises. Synthetic microdata generation for database privacy protection. Technical
Report CRIREP-04-2004, The CRISES research group, 2004.
[35] R. A. Dandekar, M. Cohen, and N. Kirkendall. Sensitive micro data protection using latin
hypercube sampling technique. In Inference Control in Statistical Databases, From Theory
to Practice, pages 117{125, London, UK, 2002. Springer-Verlag.
[36] R. A. Dandekar, J. Domingo-Ferrer, and F. Seb e. Lhs-based hybrid microdata vs rank
swapping and microaggregation for numeric microdata protection. In Inference Control in
Statistical Databases, From Theory to Practice.
[37] H. A. David and H. N. Nagaraja. Order Statistics. Wiley & Sons, 3 edition, 2003.
[38] A. C. Davison and D. V. Hinkley. Bootstrap Methods and Their Application. Cambridge
University Press, 1997.
[39] L. Devroye. Non-Uniform Random Variate Generation. Springer-Verlag, 1986.
[40] L. Devroye and G. Lugosi. Combinatorial Methods in Density Estimation. Springer, 2001.
[41] C. B. D.J. Newman, S. Hettich and C. Merz. UCI repository of machine learning databases,
1998.
[42] J. Domingo-Ferrer and V. Torra. A quantitative comparison of disclosure control methods
for microdata. In con dentiality, Disclosure and Data Access: Theory and Practical
Applications for Statistical Agencies, pages 111{134. North-Holland, 2001.
[43] R. Duda, P. Hart, and D. Stork. Pattern Classi cation. John Wiley & Sons, 2 edition,
2000.
[44] B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993.
[45] L. Ert�oz, M. Steinbach, and V. Kumar. Finding clusters of di erent sizes, shapes, and
densities in noisy, high dimensional data. In Proceedings of the Third SIAM International
Conference on Data Mining. SIAM, May 2003.
[46] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases with noise. In Proceedings 2nd Intern'l Conference
Knowledge Discovery and Data Mining, pages 226{231, 1996.
[47] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in timeseries
databases. In Proceedings of the ACM SIGMOD International Conference on Man-
agement of Data Conference, pages 419{429. ACM, 1994.
[48] J. B. Fraleigh. A First Course in Abstract Algebra. Addison Wesley, 6 edition, 1999.
[49] D. Francois, M.-V. Wertz, and S. M.-M. Verleysen. The concentration of fractional distances.
IEEE Transactions on Knowledge and Data Engineering, 19(7), 2007.
[50] A. W.-C. Fu, E. Keogh, L. Y. H. Lau, and C. A. Ratanamahatana. Scaling and time
warping in time series querying. In Proceedings of the 31st international conference on
Very large data bases, pages 649{660. VLDB Endowment, 2005.
[51] M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani. Mining the stock market: which
measure is best? In Proceedings of the 6th ACM International Conference on Knowledge
Discovery and Data Mining, pages 487{496. ACM, May 2000.
[52] X. Ge and P. Smyth. Deformable markov model templates for time-series pattern matching.
In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 81{90, New York, NY, USA, 2000. ACM Press.
[53] R. Gencay, F. Selcuk, and B. Whitcher. An Introduction to Wavelets and Other Filtering
Methods in Finance and Economics. Academic Press, 2001.
[54] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Sur ng wavelets on streams:
One-pass summaries for approximate aggregate queries. In Proceedings of the 27th Inter-
national Conference on Very Large Databases, pages 79{88, Roma, Italy, 2001. Morgan
Kaufmann.
[55] D. Q. Goldin and P. C. Kotidis. On similarity queries for time-series data: Constraint
speci cation and implementation. In Proceedings Constraint Programming, pages 137{153,
Cassis, France, 1995. Springer.
[56] A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proceedings
of the ACM SIGMOD International Conference on Management of Data, pages 47{57,
Boston, Massachusetts, 1984. ACM.
[57] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann
Publishers Inc., 2000.
[58] J. Han, M. Kamber, and A. Tung. Geographic Data Mining and Knowledge Discovery.
Number 21. Taylor and Francis, 2001.
[59] A. Hinneburg, C. C. Aggarwal, and D. A. Keim. What is the nearest neighbor in high
dimensional spaces? In Proceedings of the 26th International Conference on Very Large
Data Bases, pages 506{515, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers
Inc.
[60] A. Hinneburg and D. A. Keim. Optimal grid-clustering: Towards breaking the curse of
dimensionality in high-dimensional clustering. In Proceedings of the 25th International
Conference on Very Large Data Bases, pages 506{517, San Francisco, CA, USA, 1999.
Morgan Kaufmann Publishers Inc.
[61] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985.
[62] C.-M. Hsu and M.-S. Chen. Subspace clustering of high dimensional spatial data with
noises. In Proceedings of the 8th Paci c-Asia Conference on Knowledge Discovery and
Data Mining, pages 31{40. Springer, 2004.
[63] C.-M. Hsu and M.-S. Chen. On the necessary and su cient conditions of a meaningful
distance function for high dimensional data space. In Proceedings of the 2006 Conference
on Data Mining. SIAM, 2006.
[64] Z. Huang, W. Du, and B. Chen. Deriving private information from randomized data. In
Proceedings of the 2005 ACM SIGMOD international conference on Management of data,
pages 37{48, New York, NY, USA, 2005. ACM Press.
[65] Y. Huhtala, J. Karkkainen, and H. Toivonen. Mining for similarities in aligned time series
using wavelets. In Data Mining and Knowledge Discovery: Theory, Tools, and Technology,
volume 3695 of SPIE Proceedings, pages 150{160, FL, USA, 1999. The International Society
for Optical Engineering.
[66] A. Hyv�arinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley &
Sons, 2001.
[67] R. L. Iman and W. J. Conover. A distribution-free approach to inducing rank correlation
among input variables. Communications in Statistics, 11(3):311{334, 1982.
[68] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput.
Surveys, 31(3):264{323, 1999.
[69] I. T. Jollife. Principle Component Analysis. Springer-Verlag, New York, 2 edition, 2002.
[70] K. Kalpakis, D. Gada, and V. Puttagunta. Distance measures for e ective clustering of
arima time-series. In Proceedings of the IEEE 2001 International Conference on Data
Mining, pages 273{280, California, USA, 2001. IEEE Computer Society.
[71] S. D. Kamvar, T. H. Haveliwala, and G. H. Golub. Adaptive methods for the computation
of pagerank. Linear Algebra and its Applications, 386:51{65, 2004.
[72] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Exploiting the block
structure of the web for computing pagerank. Technical Report 2003-17, Stanford University,
2003.
[73] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Extrapolation methods
for accelerating pagerank computations. In Proceedings of the 12th international conference
on World Wide Web, pages 261{270, New York, NY, USA, 2003. ACM.
[74] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On the privacy preserving properties
of random data perturbation techniques. In Proceedings of the Third IEEE International
Conference on Data Mining, pages 19{22, Washington, DC, USA, 2003. IEEE Computer
Society.
[75] N. Katayama and S. Satoh. Distinctiveness-sensitive nearest neighbor search for e cient
similarity retrieval of multimedia information. In Proceedings of the 17th International Con-
ference on Data Engineering, pages 493{502, Washington, DC, USA, 2001. IEEE Computer
Society.
[76] L. Kaufman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster
Analysis. John Wiley and Sons, 19909.
[77] E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 206{215, New York, NY, USA, 2004. ACM Press.
[78] E. J. Keogh and S. Kasetty. On the need for time series data mining benchmarks: A survey
and empirical demonstration. Data Mining Knowledge Discovery, 7(4):349{371, 2003.
[79] S. W. Kim, S. Park, and W. W. Chu. E cient processing of similarity search under time
warping in sequence databases: An index-based approach. Information System, 29(5):405{
420, 2004.
[80] M. Kontaki and A. Papadopoulos. E cient similarity search in streaming time sequences.
In Proceedings of the 16th IEEE Conference on Scienti c and Statistical Database Man-
agement, pages 63{72, Santorini Island, Greece, 2004. IEEE Computer Society.
[81] N. Kumar, V. N. Lolla, E. J. Keogh, S. Lonardi, and C. A. Ratanamahatana. Time-series
bitmaps: a practical visualization tool for working with large time series databases. In
SDM, 2005.
[82] A. N. Langville and C. D. Meyer. Deeper inside pagerank. Internet Mathematics, 1(3):335{
400, 2004.
[83] A. N. Langville and C. D. Meyer. Google's PageRank and Beyond: The Science of Search
Engine Rankings. Princeton University Press, 2006.
[84] A. N. Langville and C. D. Meyer. A reordering for the pagerank problem. SIAM J.
Scienti c and Statistical Computing, 27(6):2112{2120, 2006.
[85] M. Ledoux. The Concentration of Measure Phenomenon. Amer Mathematical Society,
2001.
[86] C. P. Lee, G. H. Golub, and S. A. Zenios. A fast two-stage algorithm for computing
pagerank. Technical Report SCCM-2003-15, Stanford University, 2003.
[87] T. Li, Q. Li, S. Zhu, and M. Ogihara. A survey on wavelet applications in data mining.
SIGKDD Explorations, 4(2):49{68, 2002.
[88] C. K. Liew, U. J. Choi, and C. J. Liew. A data distortion by probability distribution. ACM
Transactions Database Systems, 10(3):395{411, 1985.
[89] B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer,
2007.
[90] S. G. Mallat. A theory for multiresolution signal decomposition: The wavelet representation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674{693,
1989.
[91] J. M. Mateo-Sanz, A. Mart nez-Ballest e, and J. Domingo-Ferrer. Fast generation of accurate
synthetic microdata. In Privacy in Statistical Databases, pages 298{306, 2004.
[92] M. D. McKay, R. J. Beckman, and W. J. Conover. A comparison of three methods for
selecting values of input variables in the analysis of output from a computer code. Tech-
nometrics, 21(2):239{245, 1979.
[93] C. D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, 2000.
[94] V. D. Milman and G. Schechtman. Asymptotic theory of nite dimensional normed spaces.
Springer-Verlag, New York, NY, USA, 1986.
[95] C. Moler. The world s largest matrix computation. MATLAB News and Notes, pages
12{13, Octor 2002.
[96] K. Muralidhar and R. Sarathy. Security of random data perturbation methods. ACM
Transactions Database Systems, 24(4):487{493, 1999.
[97] K. Muralidhar and R. Sarathy. Data shu ing - a new masking approach for numerical
data. Management Science, 52(5):658{670, 2006.
[98] J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid le: An adaptable, symmetric
multikey le structure. ACM Trans. Database Syst., 9(1):38{71, 1984.
[99] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing
order to the web. Technical Report 2003-17, Stanford University, 1998.
[100] D. B. Percival and A. T. Walden. Spectral Analysis for Physical Application. Cambridge
University Press, 1993.
[101] D. B. Percival and A. T. Walden. Wavelet Methods for Time Series Analysis. Cambridge
University Press, 2000.
[102] V. Pestov. On the geometry of similarity search: dimensionality curse and concentration
of measure. Information Processing Letters, 73(1-2):47{51, 2000.
[103] I. Popivanov and J. Miller. Similarity search over time-series data using wavelets. In
Proceedings of the 18th International Conference on Data Engineering, pages 212{221,
Washington, DC, USA, 2002. IEEE Computer Society.
[104] T. R. C. Read and N. A. C. Cressie. Goodness-of-Fit Statistics for Discrete Multivariate
Data. Spring-Verlag, 1988.
[105] V. K. Rohatgi and A. K. M. E. Saleh. Introduction to Probability and Statistics. Wiley, 2
edition, 2001.
[106] S. Ross. A First Course in Probability. Prentice Hall, 2002.
[107] G. G. Roussas. A Course in Mathematical Statistics. Academic Press, 2 edition, 1997.
[108] E. Schikuta. Grid-clustering: A fast hierarchical clustering method for very large data
sets. In Proceedings 13th Int. Conf. On Pattern Recognition, volume 2, pages 101{105.
IEEE Computer Society, 1996.
[109] D. W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. John
Wiley, New York, 1992.
[110] E. Seneta. Non-negative Matrices and Markov Chains. Springer, 2006.
[111] G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A multi-resolution clustering
approach for very large spatial databases. In Proceedings of the 24rd International Con-
ference on Very Large Data Bases, pages 428{439, San Francisco, CA, USA, 1998. Morgan
Kaufmann Publishers Inc.
[112] J. C. Sprott. Chaos and Time-Series Analysis. Oxford University Press, 2003.
[113] Z. R. Struzik and A. Siebes. The haar wavelet transform in the time series similarity
paradigm. In Proceedings of the Third European Conference on Principles of Data Mining
and Knowledge Discovery, pages 12{22, London, UK, 1999. Springer-Verlag.
[114] A. C. Tamhane and D. D. Dunlop. Statistics and Data Analysis: from Elementary to
Intermediate. Prentice Hall, 2000.
[115] P. Tendick and N. Matlo . A modi ed random perturbation method for database security.
ACM Transactions Database Systems, 19(1):47{63, 1994.
[116] J. A. Tomlin. A new paradigm for ranking pages on the world wide web. In Proceedings
of the 12th international conference on World Wide Web, pages 350{355, New York, NY,
USA, 2003. ACM Press.
[117] R. S. Tsay. Analysis of Financial Time Series. Wiley, 2002.
[118] R. S. Varga. Matrix Iterative Analysis. Springer, 2 edition, 2000.
[119] B. Vidakovic. Statistical Modeling by Wavelets. Wiley Inter-Science, 1999.
[120] M. Vlachos, D. Gunopoulos, and G. Kollios. Discovering similar multidimensional trajectories.
In Proceedings of the 18th International Conference on Data Engineering, pages
673{684, Washington, DC, USA, 2002. IEEE Computer Society.
[121] M. Vlachos, P. S. Yu, and V. Castelli. On periodicity detection and structural periodic
similarity. California, USA, 2005. SIAM.
[122] C. Wang and X. S. Wang. Supporting content-based searches on time series via approximation.
In Proceedings of the 12th International Conference on Scienti c and Statistical
Database Management, pages 69{81, 2000.
[123] W. Wang, J. Yang, and R. R. Muntz. Sting: A statistical information grid approach to
spatial data mining. In Proceedings of the 23rd International Conference on Very Large
Data Bases, pages 186{195, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers
Inc.
[124] Y.-L. Wu, D. Agrawal, and A. E. Abbadi. A comparison of dft and dwt based similarity
search in time-series databases. In Proceedings of the 9th international conference on
Information and knowledge management, 2000.
[125] B.-K. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary lp norms. In Proceed-
ings of 26th International Conference on Very Large Data Bases, pages 385{394. Morgan
Kaufmann, 2000.
[126] O. R. Zaiane, A. Foss, C.-H. Lee, and W. Wang. On data clustering analysis: Scalability,
constraints, and validation. In Proceedings of the 6th Paci c-Asia Conference on Knowledge
Discovery and Data Mining, pages 28{39. Springer, 2002.
[127] H. Zhou and D. Woodru . Clustering via matrix powering. In Proceedings of the Twenty-
third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems,
pages 136{142, New York, NY, USA, 2004. ACM Press.
dc.identifier.urihttp://tdr.lib.ntu.edu.tw/jspui/handle/123456789/27140-
dc.description.abstract隨著電腦與網際網路的使用普及後,加上資料不斷被數位儲存,致使高維度並包含大量資料的資料庫的建置與應用技術需求日趨迫切,如何有效處理高維度並包含大量資料的資料庫技術目前仍為資料勘測上極富挑戰之重要研究議題,在本論文中我們將提出幾個關於高維度資料之處理技術與量測方法。
首先我們探討高維度資料空間下距離函數非穩定現象之理論性質,並提出幾個函數非穩定現象之充分與必要條件,明確地說,我們驗證得到:高維度下,(1)資料點的距離分布之變異係數會退化為零;(2)任兩資料點與查詢點間的距離比會機率收斂到1;以及(3)資料點的距離分布之第二階動差係數會收斂到1等現象,皆為函數非穩定現象之充分與必要條件,易言之,造成距離函數非穩定現象之主要原因為資料點的距離分布之變異過小。依據此充要條件,針對面臨非穩定現象困境之距離函數我們進一步提出重設計機制,實驗證實,加大資料點的距離分布變異確實有抑制非穩定現象之效果。
此外,目前常用時間數列相似度之量測函數,往往僅考量兩數列間的走勢圖型之相似情形,數列本身之統計特質和次序關係並未被整合於相似度之量測函數內,因此我們利用時間數列之小波功率譜密度函數來整合數列之統計特質,並設計一個時間數列之相似度量測函數,稱為WPSM,實驗證實,WPSM於時間數列分群方面其精確性確實較其他常用的量測方法高,且具較低之時間複雜度,並有漲縮平移之不變性,同時可應用於不同長度之數列間。
關於多維度數值資料之隱私保護問題,我們應用獨立成分分析技巧提出以模擬為基礎之技術框架,稱PRIMP,模擬為基礎之隱私保護技術,主要應用某個模擬方法產生和原始資料相似統計特質之虛擬資料來達個人隱私保護目的,目前幾個著名的技術,其產生的虛擬資料僅能維持原始資料之某特定統計特質,如應用Cholesky分解技巧來保持變異矩陣,或應用LHS技巧來保持次序變異矩陣等統計量,不過如此將會限縮公告資料的資料可用性,相較於先前以模擬為基礎之研究,PRIMP於實例實驗上能保持更多資料的統計特質,理論上亦具較低的計算複雜度,以及非常低的資料洩漏風險;更進一步,我們另外整合PRIMP與Cholesky分解技巧,來修正PRIMP所產生的虛擬資料之變異矩陣,使其能更精確保持原始資料之變異矩陣。
PageRank是現今網路瀏覽器上非常重要的網頁排序技術,面對網際網路上數以億計之網頁,如何簡化PageRank的計算就顯得非常迫切,不過現存網頁雖多,整體來看網頁間的連結關係實際上相當稀疏,實際上許多網頁並未被其他網頁所連結,另外,有許多網頁並未連結到其他網頁,應用這些網頁間的連結結構,我們設計一個雙邊重排之PageRank演算法,相較於傳統的PageRank計算方式(即所謂Power Method)和Langville等人提出的單邊重排之PageRank演算法,實例驗證雙邊重排之PageRank演算法確可顯著縮短PageRank的計算時間。
最後,我們提出一個關於空間資料之子空間分群法,稱SCI,SCI整合網格技術和資料分布密度特質,改進以往應用網格技術會將某些有意義的群不當切割且所得的群的邊界會平行於座標軸等現象,由於SCI僅需掃描資料一次,其計算複雜度為資料數的線性函數,且有漲縮平移之不變性,與資料輸入次序無關等性質。
綜而言之,於高維度資料空間上,針對資料勘測上的幾個重要問題,我們提出處理技術與量測方法。量測方面,對於一般距離函數的非穩定現象深入瞭解其理論特質,並提出個避免非穩定現象之距離函數的重設計機制,另外,對於時間數列資料,設計一個能整合並比較數列本身統計特質之時間數列相似度量測函數;資料處理部分,關於多維度數值資料之隱私保護問題,我們提出兩個模擬為基礎之技術框架,能保持更多原始資料統計性質並具非常低的個人隱私洩漏風險,此外,我們設計一個相當有效率的雙邊重排之PageRank演算法,最後,我們提出一個有效能的空間資料之子空間分群法。
zh_TW
dc.description.abstractThe importance of discovering knowledge from a huge and high dimensional database is growing at rapid pace in recent years. This is the reason why data mining has attracted a significant amount of research attention. Recently, the curse of dimensionality has been studied extensively in several data mining problems, such as clustering, nearest neighbor
search, indexing, privacy preserving, and similarity measure of time series, because it is critical to both the performance and quality of data mining applications. In this dissertation, we explore several techniques to breaking the curse of dimensionality for some data mining problems in high dimensional space. Specifically, we first derive the the sufficient and necessary conditions of a stable (or meaningful) distance function in high dimensional data space. Further, we propose a new efficient and effective similarity measure for time series. Also, we propose two frameworks for simulation-based privacy preservation of multivariate numerical data. In addition, we propose a new reordered PageRank algorithm to speedup the computation of the PageRank vector. Finally, we develop a algorithm SCI (Subspace clustering based on Information) which integrates Shannon information and grid-based and density-based clustering to form a good solution for subspace clustering of high dimensional spatial data with noises.
Recent research results have shown that, in high dimensional space, the concept of distance (proximity) may not even be qualitative. Explicitly, Beyer et al. showed that if the Pearson variation of the distance distribution converges to zero with increasing dimensionality, the distance function will become unstable (or meaningless) in high dimensional space even with the commonly used $L_p$ metric on the Euclidean space. With regard to quality, the design of effective distance functions has been deemed a very important and challenging issue. However, the necessary condition for
unstability (i.e., the negation of the sufficient condition for the stability) of a distance function, which is required for function design, remains unknown. In this dissertation, we prove that the following conditions are equivalent to unstability: (1) the Pearson variation of the distance distribution for any given query converges to 0 with increasing dimensionality; (2) the ratio of the distances of any two points to the query converges in probability to 1 with increasing dimensionality; and (3) the 2nd moment coefficient converges to 1 with increasing dimensionality.
With these results, we have the necessary and the sufficient
conditions for unstability, whose negatives imply the sufficient and necessary conditions for stability.
Most of well known approaches adopt the Euclidean distance
or its extensions to measure similarity between time series. As mentioned above, under a broad set of conditions in high
dimensional space, the Euclidean distance function will become unstable (or meaningless). In addition, the intra-structures (or statistical characteristics) of time series, such as seasonal components, nonstationary behavior, long-term memory, and heterogeneous variations, do not be integrated into the Euclidean distance. Consequently, any similarity measurement of time series that only focuses on inter-series shapes always encounters some disadvantages, such as false alarms, normalization is required in advance, the curse of dimensionality, and the lengths of series must be the same. The wavelet power spectrum can utilize the statistical characteristics of a time series. In addition, it is easy to implement and very effective because its run time is linear to the length of series. Therefore, we propose a new similarity measure of time series, called the Wavelet Power Spectrum Measure (WPSM), based on a wavelet power spectrum. This approach has several advantages, including invariance under scale and shift transformation, no normalization during pre-processing, and applicability to series of different length. Furthermore, we can integrate WPSM with the wavelet power spectra of series to build an effective index with low dimensionality.
Privacy preservation is a fundamental issue in data mining
and knowledge discovery. Because the internal relationships of attributes are always complex and polymorphous, it is difficult to generate a synthetic data set that preserves the statistical characteristics of some given data set without information about the data's distribution. For the problem of simulation-based privacy preservation for multivariate numerical data, we propose two frameworks for generating a synthetic data set with the constraint that the probability distribution is as close as possible to that of the original set. The first framework, called PRIMP (PRivacy preserving by Independent coMPonents), is based on independent component analysis (ICA). In addition, we prove that the synthetic data generated by PRIMP is sufficiently different from the original data; thus, we are able to protect confidential
information in the original data. Also, PRIMP is easy to implement and very effective because, by using FastICA, its run time is linear to the size of the input. The second approach proposed is a hybrid method that combines PRIMP and Cholesky's decomposition technique. In theory, the hybrid method preserves the covariance matrix of the original data exactly. The method also resolves the problem of generating good seeds for the Cholesky-based approach.
PageRank is one of the most important ranking techniques
used in today's search engines. Since billions of pages already existed in Web, computing the PageRank vector is very
time-consuming procedure. How to accelerate the computation of the PageRank vector has turned out to be an important issue. In view of this, we propose a new reordered PageRank algorithm, called two-way reordered PageRank algorithm, which exploits both reverse-dangling nodes and dangling nodes to speedup the computation of the PageRank vector.
Clustering a large amount of high dimensional spatial data
sets with noises is an important challenge in data mining. In this dissertation, we present a new subspace clustering method, called SCI (Subspace Clustering based on Information), to solve this problem. The SCI combines Shannon information with grid-based and density-based clustering techniques. The design of clustering algorithms is equivalent to construct an equivalent relationship among data points. Therefore, we propose an equivalent relationship, named density-connected, to identify the main bodies of clusters. For the purpose of noise detection and cluster boundary discovery, we also use the grid approach to devise a new
cohesive mechanism to merge data points of borders into clusters and to filter out the noises. However, the curse of dimensionality is a well-known serious problem of using grid approach on high dimensional data sets because the number of the grid cells grows exponentially in dimensions. To strike a compromise between the randomness and the structure, we propose an automatic method for attribute selection based on the Shannon information. With the merit of only requiring one data scan, algorithm SCI is very efficient with its run time being linear to the size of the input data set. In addition, SCI has several advantages, including being scalable to the number of attributes, robust to noises, invariant under the location-scale transformation, and independent of the order of input.
en
dc.description.provenanceMade available in DSpace on 2021-06-12T17:56:16Z (GMT). No. of bitstreams: 1
ntu-97-F90921070-1.pdf: 6706364 bytes, checksum: 17f7968d3c27f335504f4d5b55250a05 (MD5)
Previous issue date: 2008
en
dc.description.tableofcontentsContents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 On the Design of Distance Functions in High Dimensional Data Space 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 On a Meaningful Distance Function . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Related Theorems of Probability . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Theoretical Properties of Unstability . . . . . . . . . . . . . . . . . . . . . 13
2.4 Indices for Testing Meaningful Functions . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 Indices Based on the Extremal Ratio . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Indices Based on an Arbitrary Ratio . . . . . . . . . . . . . . . . . . . . . 21
2.5 Shrinkage-Divergence Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 Denition of SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.2 Properties of SDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.1 Meaningful Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.2 Evaluation Based on the Extremal Ratio . . . . . . . . . . . . . . . . . . . 28
2.6.3 Evaluation Based on the Arbitrary Ratio . . . . . . . . . . . . . . . . . . . 29
2.6.4 Clustering Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Similarity Measurement Between Time Series Based on Wavelet Power Spec-
tra 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Wavelet Power Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.1 Discrete Wavelet Transformation . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 Wavelet Coecients for Scale and Shift-transformed Series . . . . . . . . . 46
3.3.3 Wavelet Power Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Similarity Measure based on The Wavelet Power Spectrum . . . . . . . . . . . . . 51
3.4.1 Wavelet Power Spectrum Measure (WPSM) . . . . . . . . . . . . . . . . . 51
3.4.2 Properties of WPSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.1 The Discriminability of Models . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.2 Property Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4 Simulation-based Approaches for Privacy Preservation 66
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 Privacy Preservation with Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.1 Simulation-Based Privacy Preservation . . . . . . . . . . . . . . . . . . . . 73
4.3.2 Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.3 ICA-based Approach for Privacy Preservation . . . . . . . . . . . . . . . . 78
4.3.4 Properties of PRIMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 A Hybrid PRIMP and Cholesky-based Approach for Privacy Preservation . . . . . 88
4.5 Experiment Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5 Two-Way Reordered PageRank Algorithm 100
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.1 Web model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.2 Google's PageRank model . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.3 Linear System for PageRank Problem . . . . . . . . . . . . . . . . . . . . . 104
5.4 Reordered PageRank Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.1 Reverse-Dangling Nodes and Dangling Nodes . . . . . . . . . . . . . . . . 105
5.4.2 Two-Way Reordered PageRank Algorithm . . . . . . . . . . . . . . . . . . 108
5.4.3 Other reordering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4.4 Remarks on Solving the linear subsystem . . . . . . . . . . . . . . . . . . . 114
5.5 Experiment Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6 Subspace Clustering of High Dimensional Large Spatial Data 117
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.3 Subspace Clustering based on Information . . . . . . . . . . . . . . . . . . . . . . 121
6.3.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3.2 Algorithm SCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.3.3 Properties of Algorithm SCI . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7 Conclusions 132
Bibliography 135
dc.language.isoen
dc.subject距離函數zh_TW
dc.subject高維度魔咒zh_TW
dc.subject小波功率譜函數zh_TW
dc.subject隱私保護zh_TW
dc.subjectGoogle 的PageRankzh_TW
dc.subject資料分群法zh_TW
dc.title高維度資料之處理與量測zh_TW
dc.titleOn the Processing and Measurement of High Dimensional Dataen
dc.typeThesis
dc.date.schoolyear96-1
dc.description.degree博士
dc.contributor.oralexamcommittee陳良弼(Arbee L.P. Chen),蔣以仁(I-Jen Chiang),李強(Chiang Lee),呂永和(Yungho Leu),曾新穆(Vincent S.M. Tseng)
dc.subject.keyword高維度魔咒,距離函數,小波功率譜函數,隱私保護,Google 的PageRank,資料分群法,zh_TW
dc.subject.keywordCurse of dimensionality,Distance function,Wavelet power spectrum,Privacy preservation,Data clustering,en
dc.relation.page145
dc.rights.note有償授權
dc.date.accepted2008-02-01
dc.contributor.author-college電機資訊學院zh_TW
dc.contributor.author-dept電機工程學研究所zh_TW
顯示於系所單位:電機工程學系

文件中的檔案:
檔案 大小格式 
ntu-97-1.pdf
  未授權公開取用
6.55 MBAdobe PDF
顯示文件簡單紀錄


系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。

社群連結
聯絡資訊
10617臺北市大安區羅斯福路四段1號
No.1 Sec.4, Roosevelt Rd., Taipei, Taiwan, R.O.C. 106
Tel: (02)33662353
Email: ntuetds@ntu.edu.tw
意見箱
相關連結
館藏目錄
國內圖書館整合查詢 MetaCat
臺大學術典藏 NTU Scholars
臺大圖書館數位典藏館
本站聲明
© NTU Library All Rights Reserved