資料引用之研究

Yi-Hung Huang; 黃曳弘

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4387

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	林軒田
dc.contributor.author	Yi-Hung Huang	en
dc.contributor.author	黃曳弘	zh_TW
dc.date.accessioned	2021-05-14T17:42:00Z	-
dc.date.available	2016-02-15
dc.date.available	2021-05-14T17:42:00Z	-
dc.date.copyright	2016-02-15
dc.date.issued	2015
dc.date.submitted	2016-01-27
dc.identifier.citation	[1] P. D. Allison. Inequality and Scientific Productivity. Social Studies of Science, 10(2):163–179, May 1980. [2] P. D. Allison, J. S. Long, and T. K. Kraze. Cumulative advantage and inequality in science. Ame. Sociological Review, 47(5):615–625, 1982. [3] S. Arbesman. The Half-life of Facts: Why Everything We Know Has an Expiration Date. Current Hardcover, first edition edition, Sept. 2012. [4] A. Bairoch, R. Apweiler, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al. The universal protein resource (uniprot). Nucleic Acids Research, 33(suppl 1):154–159, 2005. [5] J. Bardeen, L. N. Cooper, and J. R. Schrieffer. Microscopic theory of superconductivity. Phys. Rev., 106:162–164, Apr 1957. [6] J. Bardeen, L. N. Cooper, and J. R. Schrieffer. Theory of superconductivity. Phys. Rev., 108:1175–1204, Dec 1957. [7] A. Bateman, L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E. L. Sonnhammer, et al. The pfam protein families database. Nucleic Acids Research, 32(suppl 1):138–141, 2004. [8] J. Bednorz and K. Muller. Possible high $T_c$ superconductivity in the Ba–La–Cu–O system. Z. Phys. B, 64(2):189–193, June 1986. [9] H. Berman, K. Henrick, and H. Nakamura. Announcing the worldwide protein data bank. Nature Structural & Molecular Biology, 10(12):980–980, 2003. [10] H. Berman, K. Henrick, H. Nakamura, and J. L. Markley. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic acids research, 35(suppl 1):301–303, 2007. [11] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The protein data bank. Nucleic acids research, 28(1):235–242, 2000. [12] B. Boeckmann, A. Bairoch, R. Apweiler, M.-C. Blatter, A. Estreicher, E. Gasteiger, M. J. Martin, K. Michoud, C. O’Donovan, I. Phan, et al. The swiss-prot protein knowledgebase and its supplement trembl in 2003. Nucleic Acids Research, 31(1):365–370, 2003. [13] P. Bonacich. Power and centrality: a family of measures. The American Journal of Sociology, 92(5):1170–1182, 1987. [14] P. E. Bourne, K. J. Addess, W. F. Bluhm, L. Chen, N. Deshpande, Z. Feng, W. Fleri, R. Green, J. C. Merino-Ott, W. Townsend-Merino, et al. The distribution and query systems of the rcsb protein data bank. Nucleic acids research, 32(suppl 1):223–225, 2004. [15] H. Boutselakis, D. Dimitropoulos, J. Fillon, A. Golovin, K. Henrick, A. Hussain, J. Ionides, M. John, P. A. Keller, E. Krissinel, et al. E-MSD: the European bioinformatics institute macromolecular structure database. Nucleic Acids Research, 31(1):458–462, 2003. [16] D. C. Chan, D. Fass, J. M. Berger, and P. S. Kim. Core structure of gp41 from the hiv envelope glycoprotein. Cell, 89(2):263–273, 1997. [17] C. Chen. Searching for intellectual turning points: Progressive knowledge domain visualization. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5303–5310, Apr. 2004. [18] P. Chen and S. Redner. Community structure of the physical review citation network, Nov 2009. Comments: 14 pages, 7 figures, 8 tables. [19] P. Chen, H. Xie, S. Maslov, and S. Redner. Finding scientific gems with google’s PageRank algorithm. Journal of Informetrics, 1(1):8–15, Jan. 2007. [20] V. Cherezov, D. M. Rosenbaum, M. A. Hanson, S. G. Rasmussen, F. S. Thian, T. S. Kobilka, H.-J. Choi, P. Kuhn, W. I. Weis, B. K. Kobilka, et al. High-resolution crystal structure of an engineered human beta 2-adrenergic g protein–coupled receptor. science, 318(5854):1258–1265, 2007. [21] U. Consortium et al. Uniprot: a hub for protein information. Nucleic acids research, page gku989, 2014. [22] L. A. Davidson and K. Douglas. Digital Object Identifiers: Promise and problems for scholarly publishing. Journal of Electronic Publishing, 4(2), 1998. [23] N. Deshpande, K. J. Addess, W. F. Bluhm, J. C. Merino-Ott, W. Townsend-Merino, Q. Zhang, C. Knezevich, L. Xie, L. Chen, Z. Feng, et al. The RCSB Protein Data Bank: a redesigned query system and relational database based on the mm- CIF schema. Nucleic acids research, 33(suppl 1):233–237, 2005. [24] R. Edgar, M. Domrachev, and A. E. Lash. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research, 30(1):207–210, 2002. [25] P. Edwards, M. Burghammer, V. Ratnala, R. Sanishvili, R. Fischetti, G. Schertler, W. Weis, and B. Kobilka. Crystal structure of the human beta2 adrenergic g-proteincoupled receptor. Nature, 450(7168):383387, 2007. [26] V. J. Emery. Theory of high-t_c superconductivity in oxides. Physical Review Letters, 58(26):2794–2797, June 1987. [27] R. D. Finn, J. Tate, J. Mistry, P. C. Coggill, S. J. Sammut, H.-R. Hotz, G. Ceric, K. Forslund, S. R. Eddy, E. L. Sonnhammer, et al. The pfam protein families database. Nucleic Acids Research, 36(suppl 1):281–288, 2008. [28] FORCE11 Data Citation Synthesis Group. Joint Declaration of Data Citation Principles - FINAL. 2014. [29] R. Ghosh, T.-T. Kuo, C.-N. Hsu, S.-D. Lin, and K. Lerman. Time-aware ranking in dynamic citation networks. In COMMPER 2011: Mining Communities and People Recommendations, Data Mining Workshops (ICDMW), 2010 IEEE International Conference on, pages 373 –380, December 2011. [30] R. Ghosh and K. Lerman. A framework for quantitative analysis of cascades on networks. In Proceedings of Web Search and Data Mining Conference (WSDM), February 2011. [31] S. Goel, D. J. Watts, and D. G. Goldstein. The structure of online diffusion networks. In Proceedings of the 13th ACM Conference on Electronic Commerce (EC 2012), 2012. [32] A. Golovin, T. Oldfield, J. G. Tate, S. Velankar, G. J. Barton, H. Boutselakis, D. Dimitropoulos, J. Fillon, A. Hussain, J. M. Ionides, et al. E-MSD: an integrated data resource for bioinformatics. Nucleic Acids Research, 32(suppl 1):211–216, 2004. [33] S. Griffiths-Jones, R. J. Grocock, S. Van Dongen, A. Bateman, and A. J. Enright. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Research, 34(suppl 1):140–144, 2006. [34] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer, 2009. [35] K. Henrick, Z. Feng, W. F. Bluhm, D. Dimitropoulos, J. F. Doreleijers, S. Dutta, J. L. Flippen-Anderson, J. Ionides, C. Kamada, E. Krissinel, et al. Remediation of the protein data bank archive. Nucleic acids research, 36(suppl 1):426–433, 2008. [36] J. E. Hirsch. An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences, 102(46):16569–16572, Nov. 2005. [37] Y.-H. Huang, C.-N. Hsu, and K. Lerman. Identifying transformative scientific research. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 291–300, 2013. [38] Y.-H. Huang, P. W. Rose, and C.-N. Hsu. Citing a data repository: A case study of the protein data bank. PloS one, 10(8):e0136631, 2015. [39] S. Iijima. Helical microtubules of graphitic carbon. Nature, 354:56–58, Nov. 1991. [40] S. Kafkas, J.-H. Kim, and J. R. McEntyre. Database citation in full text biomedical articles. PLoS ONE, 8(5):e63184, 2013. [41] A. B. Kahn. Topological sorting of large networks. Communications of the ACM, 5(11):558–562, 1962. [42] A. Klamer and H. P. Van Dalen. Attention and the art of scientific publishing. Journal of Economic Methodology, 9(3):289–315, 2002. [43] A. Kouranov, L. Xie, J. de la Cruz, L. Chen, J. Westbrook, P. E. Bourne, and H. M. Berman. The rcsb pdb information portal for structural genomics. Nucleic acids research, 34(suppl 1):302–305, 2006. [44] T. S. Kuhn. The Structure of Scientific Revolutions: 50th Anniversary Edition. University Of Chicago Press, fourth edition edition, 2012. [45] E. S. Lang, P. C. Wyer, and R. B. Haynes. Knowledge translation: closing the evidence-to-practice gap. Annals of emergency medicine, 49(3):355–363, 2007. [46] K. Lerman and R. Ghosh. Information contagion: an empirical study of spread of news on digg and twitter social networks. In Proceedings of 4th International Conference on Weblogs and Social Media (ICWSM), May 2010. [47] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. Cascading behavior in large blog graphs. In Proceedings of 7th SIAM International Conference on Data Mining (SDM), Apr. 2007. [48] A. Mazloumian, Y.-H. Eom, D. Helbing, S. Lozano, and S. Fortunato. How Citation Boosts Promote Scientific Paradigm Shifts and Nobel Prizes. PLoS ONE, 6(5):e18975+, May 2011. [49] R. K. Merton. The Matthew Effect in Science. Science, 159(3810):56–63, Jan. 1968. [50] R. K. Merton. The matthew effect in science, II: Cumulative advantage and the symbolism of intellectual property. Isis, 79(4):606–623, 1988. [51] Z. S. S. Morris, S. Wooding, and J. Grant. The answer is 17 years, what is the question: understanding time lags in translational research. Journal of the Royal Society of Medicine, 104(12):510–520, 2011. [52] S. Myers and J. Leskovec. Clash of the contagions: Cooperation and competition in information diffusion. In Proceedings of ICDM, 2012. [53] A. Neveol, W. J. Wilbur, and Z. Lu. Improving links between literature and biological data with text mining: a case study with geo, pdb and medline. Database, 2012:bas026, 2012. [54] K. S. Novoselov, A. K. Geim, S. V. Morozov, D. Jiang, Y. Zhang, S. V. Dubonos, I. V. Grigorieva, and A. A. Firsov. Electric Field Effect in Atomically Thin Carbon Films. Science, 306(5696):666–669, 2004. [55] K. S. Novoselov, D. Jiang, F. Schedin, T. J. Booth, V. V. Khotkevich, S. V. Morozov, and A. K. Geim. Two-dimensional atomic crystals. Proceedings of the National Academy of Sciences of the United States of America, 102(30):10451–10453, July 2005. [56] T. Okada, Y. Fujiyoshi, M. Silow, J. Navarro, E. M. Landau, and Y. Shichida. Functional role of internal water molecules in rhodopsin revealed by x-ray crystallography. Proceedings of the National Academy of Sciences, 99(9):5982–5987, 2002. [57] K. Palczewski, T. Kumasaka, T. Hori, C. A. Behnke, H. Motoshima, B. A. Fox, I. Le Trong, D. C. Teller, T. Okada, R. E. Stenkamp, et al. Crystal structure of rhodopsin: Ag protein-coupled receptor. science, 289(5480):739–745, 2000. [58] J. Priem, D. Taraborelli, P. Groth, and C. Neylon. Altmetrics: A manifesto. 2010. [59] A. Prlic, M. A. Martinez, D. Dimitropoulos, B. Beran, B. T. Yukich, P. W. Rose, P. E. Bourne, and J. L. Fink. Integration of open access literature into the rcsb protein data bank using biolit. BMC bioinformatics, 11(1):220, 2010. [60] K. D. Pruitt, T. Tatusova, and D. R. Maglott. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research, 35(suppl 1):61–65, 2007. [61] S. Redner. Citation Statistics from 110 Years of Physical Review. Physics Today, 58(6):49–54, 2005. [62] H. Sayyadi and L. Getoor. Future rank: Ranking scientific articles by predicting their future PageRank. In 2009 SIAM International Conference on Data Mining (SDM09), 2009. [63] D. M. Standley, A. R. Kinjo, K. Kinoshita, and H. Nakamura. Protein structure databases with new web services for structural biology and biomedical research. Briefings in bioinformatics, 9(4):276–285, 2008. [64] L. Subelj, D. Fiala, and M. Bajec. Network-based statistical comparison of citation topology of bibliographic databases. Scientific reports, 4, 2014. [65] J. Tang, L. Yao, D. Zhang, and J. Zhang. A combination approach to web user profiling. ACM TKDD, 5(1):1–44, 2010. [66] J. Tang, D. Zhang, and L. Yao. Social network extraction of academic researchers. In ICDM’07, pages 292–301, 2007. [67] J. Tang, J. Zhang, R. Jin, Z. Yang, K. Cai, L. Zhang, and Z. Su. Topic level expertise search over heterogeneous networks. Machine Learning Journal, 82(2):211–237, 2011. [68] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: Extraction and mining of academic social networks. In KDD’08, pages 990–998, 2008. [69] Task Group on Data Citation Standards and Practices, CODATA-ICSTI. Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data. Data Science Journal, 12(0):1–75, 2013. [70] N. S. B. (U.S.). Enhancing support of transformative research at the National Science Foundation [electronic resource]. National Science Foundation, Arlington, VA :, 2007. [71] R. Van Noorden, B. Maher, and R. Nuzzo. The top 100 papers. Nature, 514(7524):550–553, 2014. [72] W. Weissenhorn, A. Dessen, S. Harrison, J. Skehel, and D. Wiley. Atomic structure of the ectodomain from hiv-1 gp41. Nature, 387(6631):426–430, 1997. [73] J. Westbrook, Z. Feng, L. Chen, H. Yang, and H. M. Berman. The protein data bank and structural genomics. Nucleic acids research, 31(1):489–491, 2003. [74] J. Westbrook, Z. Feng, S. Jain, T. Bhat, N. Thanki, V. Ravichandran, G. L. Gilliland, W. Bluhm, H. Weissig, D. S. Greer, et al. The protein data bank: unifying the archive. Nucleic Acids Research, 30(1):245–248, 2002. [75] J. Westbrook, N. Ito, H. Nakamura, K. Henrick, and H. M. Berman. PDBML: the representation of archival macromolecular structure data in XML. Bioinformatics, 21(7):988–992, 2005. [76] M. K. Wu, J. R. Ashburn, C. J. Torng, P. H. Hor, R. L. Meng, L. Gao, Z. J. Huang, Y. Q. Wang, and C. W. Chu. Superconductivity at 93 K in a new mixed-phase Y-Ba-Cu-O compound system at ambient pressure. Physical Review Letters, 58(9), 1987. [77] F. C. Zhang and T. M. Rice. Effective Hamiltonian for the superconducting Cu oxides. Physical Review B, 37(7):3759–3761, Mar. 1988.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/4387	-
dc.description.abstract	在本文中，我們著重於分析數據資料庫之各種資料引用相關研究。我們認為，一致性的資料引用的實作將有助於推動的數據共享與增進數據重複使用性，因為它可被視為類比於期刊或其他出版物中的引用模式並受相關領域使用者的認可。蛋白質資料庫（Protein Data Bank，PDB) 為一個專門儲存蛋白質及核酸之三維結構資料的數據庫。他們大部份扮演了生物機制中關鍵的角色。這些資料數據主要經由世界各地的結構生物學家以X 射線晶體學或NMR 光譜學實驗所結構化而得。各個主要的科學雜誌要求科學家將自己的研究成果提交給PDB，並以獨立識別碼(PDB IDs) 存放到PDB 供公眾免費使用，是結構生物學研究中的重要資源。因此，PDB 是一個很好的實作對象用以進行資料引用之相關研究。我們的研究考慮PDB ID 在本文中提及的模式與其引用至參考文獻的模式之間的交互作用，並且藉由研究該資料引用模式來表達此兩種引用機制之間的相對重要性。通過探索這些豐富的蛋白質結構資料和相關的引文中，我們可以從引文網絡的觀點來研究蛋白質結構之間的關係。此外，文獻和數據引網絡的分析可以顯示潛在的科學發展途徑，即知識和數據如何被用於推進結構生物學的發展之過程。基於這些分析的結果，我們可以提出適當的資料引用的實作方法，用以鏈接引用與資料兩者，以及衡量資料使用度量方式。這將有利於資料的重複使用，並有助於實驗過程的再現性，甚至提供機器可識別之資料使用追蹤能力。	zh_TW
dc.description.abstract	In this thesis, we focus on analyzing the various of data citation to the data repository. We think consistent practice of data citation facilitates and incentivizes data sharing and reuse because it could be counted as professional recognition for data providers as citations of journal and other types publications. The Protein Data Bank (PDB) is the worldwide repository of 3D structures of proteins, nucleic acids and complex assemblies, most of which play essential biological roles. The major data of PDB are the experimentally determined structures of protein, and are provided by unique identifiers (PDB IDs) and corresponding primary citations that make them easier to be used as the referenced data. Therefore, it could be a good practice model for data citation research. Meanwhile, our studies focus on the interplay of PDB IDs mentions recognition and references cited of the literature, and the relative importance of these two mechanisms can be expressed by investigating the data citation patterns. By exploring rich structures and related citations of PDB, we can investigate the relationships between protein structures from the viewpoint of the citation network. Moreover, the analysis of the literature and data citation networks may demonstrate potential pathways of scientific discovery, that is, how knowledge and data were used to advance a particular field in structural biology. Based on the results of analyses, we could recommend data citation and provenance practices, approaches to discover data citations, ways of linking citations and data, and data access metrics. We hope our work will benefit the data reused, experiments reproduced, and even provide machine readability for tracing the data usage.	en
dc.description.provenance	Made available in DSpace on 2021-05-14T17:42:00Z (GMT). No. of bitstreams: 1 ntu-104-D98922025-1.pdf: 1339915 bytes, checksum: bcaef9a1bc30c3afab5ac96754f4dcf5 (MD5) Previous issue date: 2015	en
dc.description.tableofcontents	口試委員會審定書iii 誌謝v 摘要vii Abstract ix 1 Introduction 1 1.1 Motivation and Overview of the Thesis 1 1.1.1 Data Citation1 1.1.2 RCSB Protein Data Bank and Related Repository 2 1.1.3 Transformative Research 3 1.2 Organization of the Thesis3 2 Identifying Transformative Scientific Research 5 2.1 Introduction 5 2.2 Related Work 8 2.3 Materials and Methods 10 2.3.1 Data 10 2.3.2 Cascade 11 2.3.3 Cascade Disruption 12 2.3.4 Computing Cascade Disruption 14 2.4 Evaluation 14 2.4.1 Validity 15 2.4.2 Reliability 17 2.4.3 Scalability 19 2.5 Results and Discussion 20 2.5.1 Physics 21 2.5.2 Computer Science 26 2.6 Summary 27 3 Citing the Protein Data Bank and Related Repository 29 3.1 Introduction 29 3.2 Related Work 30 3.3 Materials and Methods 31 3.3.1 Paper Citation Data 31 3.3.2 Mining URL Mentions 32 3.3.3 PDB Usage Statistics 33 3.3.4 Calibrated Disruption Score 33 3.4 Results and Discussion 34 3.4.1 Paper Citations 34 3.4.2 URL Mentions 36 3.4.3 Data Usage Statistics 38 3.5 Summary 40 4 Data Citation to the Protein Data Bank 43 4.1 Introduction 43 4.2 Materials and Methods 45 4.2.1 Citation data 45 4.2.2 Mention data 45 4.2.3 Mentions of issued PDB IDs 46 4.2.4 G-test of Independence 47 4.2.5 Pearson Correlation Coefficient47 4.2.6 Co-citations/mentions between PDB Entries 48 4.2.7 Jaccard Index 48 4.3 Results and Discussion 49 4.3.1 User Tendency to the PDB Data Citation 49 4.3.2 Trends of Protein Structure Researches 49 4.3.3 Statistic Test to the Data Citation 51 4.3.4 Analysis of the Co-citation/mention Patterns 52 4.3.5 Identification of the Influential PDB Entries 54 4.3.6 If the Authors Clearly Cite Data Sources Will Also Help Improve Impact of Their Own Papers? 56 4.4 Summary 58 5 Summaries and Future Work 61 5.1 Summary of the results 61 5.2 Limitations 62 5.3 Future directions 63 Appendices 65 A Estimation of the Subsampling Error of Disruption Score 67 B Model-based Approximation for Cascade Generating Function 73 Bibliography 77
dc.language.iso	en
dc.title	資料引用之研究	zh_TW
dc.title	A Study of Data Citation	en
dc.type	Thesis
dc.date.schoolyear	104-1
dc.description.degree	博士
dc.contributor.coadvisor	許鈞南
dc.contributor.oralexamcommittee	趙坤茂,陳倩瑜,莊庭瑞,劉家宏
dc.subject.keyword	資料引用,引用網路,資訊串集,蛋白質資料庫,	zh_TW
dc.subject.keyword	Data Citation,Citation Network Analysis,Information Cascade,Protein Data Bank,	en
dc.relation.page	85
dc.rights.note	同意授權(全球公開)
dc.date.accepted	2016-01-28
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	資訊工程學研究所	zh_TW
顯示於系所單位：	資訊工程學系

文件中的檔案：

檔案	大小	格式
ntu-104-1.pdf	1.31 MB	Adobe PDF	檢視/開啟

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。