匿名統計推論的理論極限:異質性感測網路與群眾分包之隱私考量

Wei-Ning Chen; 陳偉寧

請用此 Handle URI 來引用此文件： http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71730

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.advisor	王奕翔(I-Hsiang Wang)
dc.contributor.author	Wei-Ning Chen	en
dc.contributor.author	陳偉寧	zh_TW
dc.date.accessioned	2021-06-17T06:07:56Z	-
dc.date.available	2019-01-07
dc.date.copyright	2019-01-07
dc.date.issued	2017
dc.date.submitted	2018-12-27
dc.identifier.citation	[1] A. E. Alaoui, A. Ramdas, F. Krzakala, L. Zdeborova, and M. I. Jordan. Decoding from pooled data: Sharp information-theoretic bounds. arXiv:1611.09981, 2016. [2] A. E. Alaoui, A. Ramdas, F. Krzakala, L. Zdeborova, and M. I. Jordan. Decoding from pooled data: Phase transitions of message passing. arXiv:1702.02279, 2017. [3] M. Aldridge, L. Baldassini, and O. Johnson. Group testing algorithms: bounds and simulations. IEEE Transactions on Information Theory, 60(6):3671–3687, 2014. [4] G. K. Atia and V. Saligrama. Boolean compressed sensing and noisy group testing. IEEE Transactions on Information Theory, 58(3):1880–1901, 2012. [5] N. H. Bshouty. Optimal algorithms for the coin weighing problem with a spring scale. The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 2009. [6] N. H. Bshouty. On the coin weighing problem with the presence of noise. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques - 15th International Workshop, 2012. [7] N. H. Bshouty and H. Mazzawi. Algorithms for the coin weighing problems with the presence of noise. Electronic Colloquium on Computational Complexity, 2011. [8] M.Bun, J.Ullman, and S.Vadhan. Fingerprinting codes and the price of approximate differential privacy. Proceedings of the 47th Annual ACM Symposium on Theory of Computing (STOC), 2014 [9] W.-N. Chen, H.-C. Chen, and I.-H. Wang. On the fundamental limits of heterogeneously distributed detection: Price of anonymity. IEEE International Symposium on Information Theory (ISIT), June 2018. [10] W.-N. Chen and I.-H. Wang. Partial data extraction via noisy histogram queries: Information theoretic bounds. Proceedings of IEEE International Symposium on Information Theory, 2017. [11] W.-N. Chen and I.-H. Wang. Partial data extraction via noisy histogram queries: Information theoretic bounds. Manuscript, January 2017. http://homepage.ntu.edu.tw/%7Eihwang/Research/Eprints/isit17nq.pdf. [12] W.-N. Chen and I.-H. Wang. Anonymous heterogeneous distributed detection: Optimal decision rules, error exponents, and the price of anonymity. arXiv:1805.03554, 2018, https://arxiv.org/abs/1805.03554. [13] T. M. Cover and J. A. Thomas. Elements of Information Theory. Number 0471241954. Wiley-Interscience, 2006. [14] I. Csiszár. A simple proof of Sanov’s theorem. Bull. Braz. Math. Soc. (N.S.), 2006. [15] A. Dembo and O. Zeitouni. Large Deviations Techniques and Applications, volume Stochastic Modelling and Applied Probability of 38. Springer-Verlag, 2010. [16] F. den Hollander. Large Deviations. Number 14 in Fields Institute Monographs. American Mathematical Society, 2000. [17] I. Dinur and K. Nissim. Revealing information while preserving privacy. Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 202–210, 2003. [18] C. Dwork, F. McSherry, and K. Talwar. The price of privacy and the limits of lp decoding. Proceedings of the 39th Annual ACM Symposium on Theory of Computing (STOC), pages 85–94, June 2007. [19] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Theoretical Computer Science, pages 211–407, 2013. [20] C. Dwork and S. Yekhanin. New efficient attacks on statistical disclosure control mechanism. In Proceedings of the Advances in Cryptology-CRYPTO, pages 469– 480, 2008. [21] C. George and R. L. Berger. Statistical inference. Duxbury, 2002. [22] W. Hoeffding. Asymptotically optimal tests for multinomial distributions. Annals of Mathematical Statistics, 36(2):369–401, 1965. [23] P. J. Huber. A robust version of the probability ratio test. Annals of Mathematical Statistics, 36(6):1753–1758, 1965. [24] P. J. Huber and V. Strassen. Minimax tests and the Neyman-Pearson lemma for capacities. Annals of Statistics, 1(2):251–263, 1973. [25] P. G. Ipeirotis, F. Provost, and J. Wang. Quality management on Amazon Mechanical Turk. Proceedings of the ACM SIGKDD Workshop on Human Computation, 2010. [26] B. Kailkhura, Y. S. Han, S. Brahma, and P. K. Varshney. Asymptotic analysis of distributed Bayesian detection with Byzantine data. IEEE Signal Processing Letters, 22, 2015. [27] T. Laarhoven. Asymptotics of fingerprinting and group testing: Tight bounds from channel capacities. IEEE Transactions on Information Forensics and Security, 10(9):1967–1980, 2015. [28] L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. AMC Transactions on Programming Languages and Systems, 4, July 1982. [29] E. L. Lehmann and J. P. Romano. Testing Statistical Hypotheses. Number 978-0- 387-27605-2. Springer-Verlag New York, 2005. [30] S. Marano, V. Matta, and L. Tong. Distributed detection in the presence of Byzantine attacks. IEEE Transactions on Signal Processing, 57(1):16–29, January 2009. [31] J. Matoušek and J. Vondrák. The probabilistic method. Lecture Notes, March 2008. [32] Y. Polyanskiy and Y. Wu. Lecture notes on information theory. August 2017. [33] D. Robert. The detection of defective members of large populations. The Annals of Mathematical Statistics, 14(4):436–440, 1943. [34] H. Royden and P. Fitzpatrick. Real Analysis. Pearson, 2010. [35] J.ScarlettandV.Cevher.Phasetransitionsingrouptesting.In Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), 2016. [36] J. Scarlett and V. Cevher. Phase transitions in the pooled data problem. Advances in Neural Information Processing Systems 30 (NIPS 2017), 2017. [37] V. Y. Tan and G. K. Atia. Strong impossibility results for sparse signal processing. Signal Processing Letters, IEEE, 21(3):260–264, 2014. [38] R. R. Tenney and N. R. Sandell. Detection with distributed sensors. IEEE Transactions on Aerospace and Electronic Systems, 1981. [39] J. N. Tsitsiklis. Decentralized detection by a large number of sensors. Mathematics of Control, Signals and Systems, 1(2):167–182, 1988. [40] J. N. Tsitsiklis. Decentralized detection. In H. V. Poor and J. B. Thomas, editors, Advances in Statistical Signal Processing, volume 2. JAI Press Inc., 1990. [41] V. V. Veeravalli, T. Başar, and H. V. Poor. Minimax robust decentralized detection. IEEE Transactions on Information Theory, 40(1):35–40, January 1994. [42] I.-H. Wang, S.-L. Huang, and K.-Y. Lee. Extracting sparse data via histogram queries. Proceedings of Annual Allerton Conference on Communications, Control, and Computing, September 2016. [43] I.-H. Wang, S.-L. Huang, K.-Y. Lee, and K.-C. Chen. Data extraction via histogram and arithmetic mean queries: Fundamental limits and algorithms. Proceedings of IEEE International Symposium on Information Theory, July 2016. [44] O. Zeitouni and M. Gutman. On universal hypothesis testing via large deviations. IEEE Transactions on Information Theory, 37(2):285–290, March 1991. [45] O. Zeitouni, J. Ziv, and N. Merhav. When is the generalized likelihood ratio test optimal? IEEE Transactions on Information Theory, 38(5):1597–1602, 1992.
dc.identifier.uri	http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71730	-
dc.description.abstract	在群眾分包的問題中，平台將許多子問題（例如影像標記）分派給不同的工人，並搜集他們的答案。顧及工人的隱私，搜集的資料通常為匿名的，讓平台在進行最後的決策與估計變得十分困難。在此論文中，我們提出了兩種解決方法，首先是以測試問題推估工人的可靠性，而第二種則是匿名假說檢定。考慮一個大小為n的資料庫，其中每筆資料紀錄了一個工人的群組（依據其可靠度所分群），而總共的群組個數為有限的。測試問題讓我們能區分不同群組，也就是說，對於不同類別的工人，他們所標記的答案是不同的。我們可以選擇其中一部份的工人記錄其標記的統計結果（亦即該部分工人當中，每個群組所佔的人數），而我們希望以最少的測試問題T_n將每個工人確切的類別復原出來。然而，在實際情況中，該統計結果往往是不精確、有雜訊的。我們假定雜訊的大小為delta_n，而我們的目標是希望復原錯誤的人數少於k_n。在這個問題中，我們主要的貢獻有以下兩點：首先，當delta_n = O(sqrtk_n)時，需要大約T_n = Theta(n/log n)數量級的測試問題，可以將工人的類別回復。對於達到上界的驗算法，我們採取“隨機取樣”的想法，將每筆測試問題以二分之一的機率隨機送給每個工人。而下界的部分，則是使用“填裝”的想法構造一個不等式。再來對於delta_n = Omega( k_n(1+epsilon)/2)（其中任意epsilon > 0）的情形，我們證明了至少需要T_n* = omega(np)的測試問題，才足以達成我們的目標。對於這種雜訊強度較高的情形，我們發展了一套新的下界方式，並使用了一些組合最佳化的技巧。在第二部分，我們考慮匿名假說檢定，在其中不同群組的工人們有著不同的標記能力，因此其答案有著不同的機率分佈。困難之處在於，雖然我們知道每個群組的確切人數，但我們並不知道每筆標記資料是來自哪個群組，因此我們將此問題約化為“多重假說檢定”。在這部分，我們的第一個貢獻是提出了最佳的測試，可寫為在不同假說底下的混合分佈之概似比檢定。第二個貢獻則是刻劃出在樣本數n趨近於無限大時，在Neyman-Pearson setting底下第二類錯誤趨近於零的速度。該收斂速度的冪次方在機率分布空間之上定義了一個廣義的賦距，同時此結果可以被推廣到Chernoff regime。我們的結果量化了匿名性對群眾分包問題所造成的效應，並且此理論工具可應用於許多不同問題，諸如感測網路、物聯網與資訊系統等等。	zh_TW
dc.description.abstract	In this thesis, we propose two treatments to overcome the anonymity issue in privacy-preserving crowdsourcing: group recovery with golden questions and anonymous hypothesis testing. Consider a data set comprising n items, each of which represents a worker and his or her group index (based on the worker's labeling reliability). The golden questions enable us to distinguish disparate groups of crowds, that is, workers from different groups will respond to different answers. By sequentially allocating the golden tasks to a subset of crowds, the fusion center will obtain the histogram of the queried subset. The (unnormalized) histogram, however, is perturbed by some additive noise with magnitude less than delta_n. The goal is to reconstruct the data set such that the Hamming distance between the reconstructed and the actual one is smaller than a tolerance parameter k_n. We are interested in the fundamental limit on the minimum number of queries T_n* required to recover the n-th worker data set within k_n tolerance subject to delta_n noisy perturbation. We first show that if delta_n = O(sqrtk_n), the minimum query complexity T_n = Theta(n/log n), where the achievability is based on random sampling, and the converse is based on a packing argument. On the other hand, if delta_n = Omega( k_n(1+epsilon)/2) for some epsilon > 0, we prove that T_n = omega(np) for any positive integer p. In other words, no querying methods with poly(n) query complexity can successfully reconstruct the data set in that regime. This impossibility result is established by a novel combinatorial lower bound on T_n*. For the second remedy, anonymous hypothesis testing, each of item in the data set corresponds to a response from a single worker. The fusion center aims to detect a binary parameter by querying the database. The workers are clustered into multiple groups, and hence the responses in different groups follow different distributions under a given hypothesis. The key challenge is the anonymity of the responses -- although it knows the exact number of workers n and the distribution of observations in each group, it does not know which group each worker belongs to. It is hence natural to consider it as a composite hypothesis testing problem. First, we propose an optimal test called mixture likelihood ratio test, which is a randomized threshold test based on the ratio of the uniform mixture of all the possible distributions under one hypothesis to that under the other hypothesis. Second, we focus on the Neyman-Pearson setting and specify the error exponent of the worst-case type-II error probability as n tends to infinity, assuming the number of workers in each group is proportional to n. The exponent defines a new divergence between two vector distributions. The results are extended to Chernoff regime by solving a convex optimization problem. Our results elucidate the price of anonymity in anonymous detection. %The results are also applied to distributed detection under Byzantine attacks, which hints that the conventional approach based on simple hypothesis testing might be too pessimistic. The developed theories and tools are not restricted to crowdsourcing problem but can be widely applied to various tasks, such as wireless sensor networks, Internet of Thing, or information retrieval system, etc..	en
dc.description.provenance	Made available in DSpace on 2021-06-17T06:07:56Z (GMT). No. of bitstreams: 1 ntu-106-R05942078-1.pdf: 1405069 bytes, checksum: 05d76a65567a63a5bbc619429649e0a2 (MD5) Previous issue date: 2017	en
dc.description.tableofcontents	誌謝 iii 摘要 iv Abstract v Chap. 1 Introduction 1.1 Privacy-preservingCrowdsourcing -- 1 1.2 GroupRecoverywithGoldenQuestions -- 3 1.3 AnonymousHypothesisTesting -- 5 1.4 Beyond Neyman-Pearson Regime: Anonymous Hypothesis Testing for BayesianFormulation -- 8 1.5 OrganizationoftheThesis -- 9 Chap. 2 Background 11 2.1 Basic Information Theory -- 11 2.1.1 Kullback-Leibler Divergence -- 11 2.1.2 Method of Types -- 12 2.2 Concentration Inequalities and LargeDeviation -- 13 2.3 Hypothesis Testing -- 14 2.3.1 Simple Hypothesis Testing -- 15 2.3.2 Asymptotic Regime -- 16 2.3.3 Composite Hypothesis Testing -- 18 Part I Group Recovery with Golden Tasks -- 21 Chap. 3 Data Extraction with Presence of Noise 23 3.1 PreviousWork -- 23 3.2 Problem Formulation -- 24 3.2.1 Data Set, Queries,and Responses -- 25 3.2.2 CriteriaofDataExtraction -- 25 3.3 Main Results -- 27 3.3.1 Achievability -- 27 3.3.2 Lower Boundson Query Complexity -- 27 3.3.3 Fundamental Limit -- 28 3.4 Achievability via Randomized Querying -- 29 3.5 Proof of the Combinatorial Lower Bound -- 32 3.6 Extension -- 35 Part II Anonymous Hypothesis Testing Chap.4 Anonymous Hypothesis Testing: Optimal Decision Rules and Type-II Error Exponents 4.1 Previous Work -- 39 4.2 Problem Formulation -- 41 4.2.1 Problem Setup -- 41 4.2.2 Notations -- 43 4.3 Main Results -- 44 4.3.1 Main Contributions -- 45 4.3.2 Numerical Evaluations -- 47 4.3.3 Distributed Detection with Byzantine Attacks -- 48 4.4 Proof of Theorem 4.3.1 -- 49 4.5 Proof of Theorem4.3.2 -- 56 4.6 Extension -- 62 Chap. 5 Anonymous Hypothesis Testing: Beyond Neyman-Pearson Regime 5.1 Problem Formulation -- 65 5.2 Main Result -- 66 5.2.1 Efficient Test -- 66 5.2.2 Achievable Region: An Information-Geometric Perspective -- 67 5.3 Proof of Theorem 5.2.1 -- 68 5.4 Proof of Theorem 5.2.2 -- 72 Chap.6 Conclusions and Future Work -- 79 6.1 Future Works -- 80 Appendix A Proof of Lemmas in Chap. 3 A.1 Proof of Theorem 3.3.2 -- 83 A.2 Proof of Theorem 3.3.4 -- 85 A.3 Proof of Claim 3.4.1 -- 87 A.4 Technical Lemmas -- 88 A.4.1 Lemma A.4.1 -- 88 A.4.2 Proof of Lemma A.3.1 -- 91 A.4.3 Proof of Lemma 3.4.1 -- 92 A.4.4 Proof of Lemma A.2.1 -- 93 A.4.5 Proof of Lemma A.2.2 -- 94 Appendix B Proof of Lemmas in Chap. 5 B.1 Proof of Proposition 4.3.1 -- 97 B.2 Proof of Lemma 4.5.1 -- 99 B.3 Proof of Lemma 4.5.2 -- 100 Bibliography 105
dc.language.iso	en
dc.title	匿名統計推論的理論極限:異質性感測網路與群眾分包之隱私考量	zh_TW
dc.title	Fundamental Limits of Anonymous Statistical Inference: Privacy-Preserving Crowdsourcing and Heterogeneous Sensor Networks	en
dc.type	Thesis
dc.date.schoolyear	107-1
dc.description.degree	碩士
dc.contributor.oralexamcommittee	陳柏寧(Po-Ning Chen),洪樂文
dc.subject.keyword	匿名,隱私,群眾分包,感測網路,資料解碼,假說檢定,	zh_TW
dc.subject.keyword	Anonymity,Crowdsourcing,DataDecoding,HypothesisTesting,	en
dc.relation.page	109
dc.identifier.doi	10.6342/NTU201804404
dc.rights.note	有償授權
dc.date.accepted	2018-12-27
dc.contributor.author-college	電機資訊學院	zh_TW
dc.contributor.author-dept	電信工程學研究所	zh_TW
顯示於系所單位：	電信工程學研究所

文件中的檔案：

檔案	大小	格式
ntu-106-1.pdf 目前未授權公開取用	1.37 MB	Adobe PDF

顯示文件簡單紀錄

系統中的文件，除了特別指名其著作權條款之外，均受到著作權保護，並且保留所有的權利。