請用此 Handle URI 來引用此文件:
http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71730
標題: | 匿名統計推論的理論極限:異質性感測網路與群眾分包之隱私考量 Fundamental Limits of Anonymous Statistical Inference: Privacy-Preserving Crowdsourcing and Heterogeneous Sensor Networks |
作者: | Wei-Ning Chen 陳偉寧 |
指導教授: | 王奕翔(I-Hsiang Wang) |
關鍵字: | 匿名,隱私,群眾分包,感測網路,資料解碼,假說檢定, Anonymity,Crowdsourcing,DataDecoding,HypothesisTesting, |
出版年 : | 2017 |
學位: | 碩士 |
摘要: | 在群眾分包的問題中,平台將許多子問題(例如影像標記)分派給不同的工人,並搜集他們的答案。顧及工人的隱私,搜集的資料通常為匿名的,讓平台在進行最後的決策與估計變得十分困難。在此論文中,我們提出了兩種解決方法,首先是以測試問題推估工人的可靠性,而第二種則是匿名假說檢定。
考慮一個大小為n的資料庫,其中每筆資料紀錄了一個工人的群組(依據其可靠度所分群),而總共的群組個數為有限的。測試問題讓我們能區分不同群組,也就是說,對於不同類別的工人,他們所標記的答案是不同的。我們可以選擇其中一部份的工人記錄其標記的統計結果(亦即該部分工人當中,每個群組所佔的人數),而我們希望以最少的測試問題T*_n將每個工人確切的類別復原出來。然而,在實際情況中,該統計結果往往是不精確、有雜訊的。我們假定雜訊的大小為delta_n,而我們的目標是希望復原錯誤的人數少於k_n。在這個問題中,我們主要的貢獻有以下兩點:首先,當delta_n = O(sqrtk_n)時,需要大約T*_n = Theta(n/log n)數量級的測試問題,可以將工人的類別回復。對於達到上界的驗算法,我們採取“隨機取樣”的想法,將每筆測試問題以二分之一的機率隨機送給每個工人。而下界的部分,則是使用“填裝”的想法構造一個不等式。再來對於delta_n = Omega( k_n(1+epsilon)/2)(其中任意epsilon > 0)的情形,我們證明了至少需要T_n* = omega(np)的測試問題,才足以達成我們的目標。對於這種雜訊強度較高的情形,我們發展了一套新的下界方式,並使用了一些組合最佳化的技巧。 在第二部分,我們考慮匿名假說檢定,在其中不同群組的工人們有著不同的標記能力,因此其答案有著不同的機率分佈。困難之處在於,雖然我們知道每個群組的確切人數,但我們並不知道每筆標記資料是來自哪個群組,因此我們將此問題約化為“多重假說檢定”。在這部分,我們的第一個貢獻是提出了最佳的測試,可寫為在不同假說底下的混合分佈之概似比檢定。第二個貢獻則是刻劃出在樣本數n趨近於無限大時,在Neyman-Pearson setting底下第二類錯誤趨近於零的速度。該收斂速度的冪次方在機率分布空間之上定義了一個廣義的賦距,同時此結果可以被推廣到Chernoff regime。 我們的結果量化了匿名性對群眾分包問題所造成的效應,並且此理論工具可應用於許多不同問題,諸如感測網路、物聯網與資訊系統等等。 In this thesis, we propose two treatments to overcome the anonymity issue in privacy-preserving crowdsourcing: group recovery with golden questions and anonymous hypothesis testing. Consider a data set comprising n items, each of which represents a worker and his or her group index (based on the worker's labeling reliability). The golden questions enable us to distinguish disparate groups of crowds, that is, workers from different groups will respond to different answers. By sequentially allocating the golden tasks to a subset of crowds, the fusion center will obtain the histogram of the queried subset. The (unnormalized) histogram, however, is perturbed by some additive noise with magnitude less than delta_n. The goal is to reconstruct the data set such that the Hamming distance between the reconstructed and the actual one is smaller than a tolerance parameter k_n. We are interested in the fundamental limit on the minimum number of queries T_n* required to recover the n-th worker data set within k_n tolerance subject to delta_n noisy perturbation. We first show that if delta_n = O(sqrtk_n), the minimum query complexity T*_n = Theta(n/log n), where the achievability is based on random sampling, and the converse is based on a packing argument. On the other hand, if delta_n = Omega( k_n(1+epsilon)/2) for some epsilon > 0, we prove that T_n* = omega(np) for any positive integer p. In other words, no querying methods with poly(n) query complexity can successfully reconstruct the data set in that regime. This impossibility result is established by a novel combinatorial lower bound on T_n*. For the second remedy, anonymous hypothesis testing, each of item in the data set corresponds to a response from a single worker. The fusion center aims to detect a binary parameter by querying the database. The workers are clustered into multiple groups, and hence the responses in different groups follow different distributions under a given hypothesis. The key challenge is the anonymity of the responses -- although it knows the exact number of workers n and the distribution of observations in each group, it does not know which group each worker belongs to. It is hence natural to consider it as a composite hypothesis testing problem. First, we propose an optimal test called mixture likelihood ratio test, which is a randomized threshold test based on the ratio of the uniform mixture of all the possible distributions under one hypothesis to that under the other hypothesis. Second, we focus on the Neyman-Pearson setting and specify the error exponent of the worst-case type-II error probability as n tends to infinity, assuming the number of workers in each group is proportional to n. The exponent defines a new divergence between two vector distributions. The results are extended to Chernoff regime by solving a convex optimization problem. Our results elucidate the price of anonymity in anonymous detection. %The results are also applied to distributed detection under Byzantine attacks, which hints that the conventional approach based on simple hypothesis testing might be too pessimistic. The developed theories and tools are not restricted to crowdsourcing problem but can be widely applied to various tasks, such as wireless sensor networks, Internet of Thing, or information retrieval system, etc.. |
URI: | http://tdr.lib.ntu.edu.tw/jspui/handle/123456789/71730 |
DOI: | 10.6342/NTU201804404 |
全文授權: | 有償授權 |
顯示於系所單位: | 電信工程學研究所 |
文件中的檔案:
檔案 | 大小 | 格式 | |
---|---|---|---|
ntu-106-1.pdf 目前未授權公開取用 | 1.37 MB | Adobe PDF |
系統中的文件,除了特別指名其著作權條款之外,均受到著作權保護,並且保留所有的權利。