1*2abb3134SXin LiRAPPOR Data Flow 2*2abb3134SXin Li================ 3*2abb3134SXin Li 4*2abb3134SXin LiThis doc explains the simulation tools and data formats in the [RAPPOR 5*2abb3134SXin Lirepository](https://github.com/google/rappor). We'll focus on the code, and 6*2abb3134SXin Lidescribe the algorithm only informally. For details, see the [paper][]. 7*2abb3134SXin Li 8*2abb3134SXin LiOverview 9*2abb3134SXin Li-------- 10*2abb3134SXin Li 11*2abb3134SXin LiStart with this command: 12*2abb3134SXin Li 13*2abb3134SXin Li $ ./demo.sh run 14*2abb3134SXin Li 15*2abb3134SXin LiIt takes a minute or so to run. The dependencies listed in the 16*2abb3134SXin Li[README][] must be installed. At the end, it will say: 17*2abb3134SXin Li 18*2abb3134SXin Li Wrote _tmp/report.html. Open this in your browser. 19*2abb3134SXin Li 20*2abb3134SXin LiIt should look like [this][example]. 21*2abb3134SXin Li 22*2abb3134SXin LiThe following diagram shows what processes and files are involved in the demo. 23*2abb3134SXin LiOvals represent **processes**; rectangles represent **data**. The dotted lines 24*2abb3134SXin Lidenote components that are involved in the simulation, but wouldn't be used in 25*2abb3134SXin Lia "real" setting. 26*2abb3134SXin Li 27*2abb3134SXin LiIn most configurations, reporting (in blue) is done by client machines, while 28*2abb3134SXin Lianalysis (in green) is done by a server. 29*2abb3134SXin Li 30*2abb3134SXin Li<img src="data-flow.png" alt="Diagram of RAPPOR Data Flow" /> 31*2abb3134SXin Li 32*2abb3134SXin LiIn the simulation, reporting consists of these steps: 33*2abb3134SXin Li 34*2abb3134SXin Li 1. Generate simulated input data with different distributions. 35*2abb3134SXin Li 2. Obscure each value with the RAPPOR privacy-preserving reporting mechanism. 36*2abb3134SXin Li 37*2abb3134SXin LiAnalysis consists of these steps: 38*2abb3134SXin Li 39*2abb3134SXin Li 1. Aggregate the reports by summing bits (i.e. make a counting Bloom filter) 40*2abb3134SXin Li 2. Come up with candidate strings, and hash them in the same manner as the 41*2abb3134SXin Li client. 42*2abb3134SXin Li 3. Using the reports, RAPPOR parameters, and candidate strings as input, 43*2abb3134SXin Li infer the distribution of true values. We don't see the values themselves. 44*2abb3134SXin Li We plot the true and inferred distributions side by side for comparison. 45*2abb3134SXin Li 46*2abb3134SXin LiThis process is described in detail below. 47*2abb3134SXin Li 48*2abb3134SXin Li1. Generating Simulated Input 49*2abb3134SXin Li----------------------------- 50*2abb3134SXin Li 51*2abb3134SXin LiThe `tests/gen_sim_input.py` tool generates CSV data, like this: 52*2abb3134SXin Li 53*2abb3134SXin Li<!-- TODO: a realistic data set would be nice? How could we generate one? --> 54*2abb3134SXin Li 55*2abb3134SXin Li**exp.csv** 56*2abb3134SXin Li 57*2abb3134SXin Li client, true_value 58*2abb3134SXin Li 1, v6 59*2abb3134SXin Li 1, v3 60*2abb3134SXin Li 1, v3 61*2abb3134SXin Li 1, v5 62*2abb3134SXin Li 1, v13 63*2abb3134SXin Li 1, v1 64*2abb3134SXin Li 1, v8 65*2abb3134SXin Li 2, v2 66*2abb3134SXin Li 2, v3 67*2abb3134SXin Li 2, v1 68*2abb3134SXin Li 2, v8 69*2abb3134SXin Li 2, v1 70*2abb3134SXin Li 2, v30 71*2abb3134SXin Li 2, v10 72*2abb3134SXin Li 3, v4 73*2abb3134SXin Li ... 74*2abb3134SXin Li 75*2abb3134SXin Li*(spaces added for clarity)* 76*2abb3134SXin Li 77*2abb3134SXin LiBy default we generate 700,000 rows: 7 random values from `v1` to `v50` for 78*2abb3134SXin Lieach client. These can be thought of as a variable being reported over time. 79*2abb3134SXin Li 80*2abb3134SXin LiWe're simulating an environment where there are many RAPPOR clients, and a 81*2abb3134SXin Lisingle server does the RAPPOR analysis on the accumulated data. 82*2abb3134SXin Li 83*2abb3134SXin LiThe `client` is represented by an integer ID. The `true_value` should **not** 84*2abb3134SXin Libe sent over the network because we wish to preserve the client's privacy. 85*2abb3134SXin Li 86*2abb3134SXin Li 87*2abb3134SXin Li2. RAPPOR Reporting 88*2abb3134SXin Li------------------- 89*2abb3134SXin Li 90*2abb3134SXin LiThe `tests/rappor_sim.py` tool uses the Python client library 91*2abb3134SXin Li(`client/python/rappor.py`) to obscure the `v1` .. `vN` strings. We want to 92*2abb3134SXin Liinfer the distribution of these strings over the entire population, but we 93*2abb3134SXin Lidon't want to know any individual values. 94*2abb3134SXin Li 95*2abb3134SXin LiAfter the RAPPOR transformation, we get another CSV file with 700,000 rows. 96*2abb3134SXin LiEach client is assigned a cohort. 97*2abb3134SXin Li 98*2abb3134SXin Li**exp_out.csv** 99*2abb3134SXin Li 100*2abb3134SXin Li client, cohort, rappor 101*2abb3134SXin Li 1, 63, 1111101011110111 102*2abb3134SXin Li 1, 15, 1110110011111100 103*2abb3134SXin Li 1, 12, 0110101111100101 104*2abb3134SXin Li 1, 0, 1111100111110111 105*2abb3134SXin Li 1, 3, 1001110111110011 106*2abb3134SXin Li 1, 14, 1011111010110011 107*2abb3134SXin Li 1, 33, 0111010100101011 108*2abb3134SXin Li 2, 40, 0011011010101001 109*2abb3134SXin Li 2, 35, 1010110101110100 110*2abb3134SXin Li 2, 58, 1110110110111110 111*2abb3134SXin Li 2, 38, 0010001111001010 112*2abb3134SXin Li 2, 5, 1110111011100101 113*2abb3134SXin Li 2, 36, 0111010100111111 114*2abb3134SXin Li 2, 39, 0101101000101101 115*2abb3134SXin Li 3, 32, 0011100111111110 116*2abb3134SXin Li ... 117*2abb3134SXin Li 118*2abb3134SXin Li*(spaces added for clarity)* 119*2abb3134SXin Li 120*2abb3134SXin LiWe also get a one-row CSV file that contains the RAPPOR parameters: 121*2abb3134SXin Li 122*2abb3134SXin Li**exp_params.csv** 123*2abb3134SXin Li 124*2abb3134SXin Li k,h,m,p,q,f 125*2abb3134SXin Li 16,2,64,0.5,0.75,0.5 126*2abb3134SXin Li 127*2abb3134SXin LiThese are described in the [paper][]. The parameters that the clients use 128*2abb3134SXin Limust be known to the server, or the decoding will fail. In addition, all 129*2abb3134SXin Liclients must use the same parameters for a given variable. 130*2abb3134SXin Li 131*2abb3134SXin LiThe `rappor_sim.py` process also writes these files: 132*2abb3134SXin Li 133*2abb3134SXin Li- `exp_hist.csv`: The true histogram of input values. This is used only for 134*2abb3134SXin Li comparison. In the real world you obviously won't have this. 135*2abb3134SXin Li- `exp_true_inputs.txt`: A list of the unique values reported, e.g. `v1` .. 136*2abb3134SXin Li `v50`. You won't have this either, in general. To use RAPPOR, you must 137*2abb3134SXin Li supply *candidate strings*, described below. 138*2abb3134SXin Li 139*2abb3134SXin Li3. Report Aggregation 140*2abb3134SXin Li--------------------- 141*2abb3134SXin Li 142*2abb3134SXin Li`sum_bits.py` takes the `exp_out.csv` output, and produces the "counts" file: 143*2abb3134SXin Li 144*2abb3134SXin Li**exp_counts.csv** 145*2abb3134SXin Li 146*2abb3134SXin Li 11116,6273,6433,6347,6385,6290,6621,6359,6747,6623,6321,6696,6282,6652,6368,6286,6222 147*2abb3134SXin Li 10861,6365,6263,6170,6258,6107,6633,6171,6226,6123,6286,6254,6408,6182,6442,6195,6187 148*2abb3134SXin Li ... 149*2abb3134SXin Li 150*2abb3134SXin LiThe file has 64 rows, because the simulation has 64 cohorts by default (`m = 151*2abb3134SXin Li64`). This parameter should be adjusted based on the number of unique true 152*2abb3134SXin Livalues expected. <!-- TODO: more detail --> 153*2abb3134SXin Li 154*2abb3134SXin LiThere are 17 columns. The left-most column is the total number of reports in 155*2abb3134SXin Lithe cohort. The remaining 16 columns correspond to the `k = 16` bits in the 156*2abb3134SXin LiBloom filter. Each column contains the number of reports with that bit set 157*2abb3134SXin Lito 1. 158*2abb3134SXin Li 159*2abb3134SXin LiSo, in general, the "counts" file is a `(k+1) * m` matrix. 160*2abb3134SXin Li 161*2abb3134SXin Li4. Candidate Strings 162*2abb3134SXin Li-------------------- 163*2abb3134SXin Li 164*2abb3134SXin LiIn the simulation, we assume that the analyst will come up with a *superset* of 165*2abb3134SXin Lithe candidate strings. This is done in the `more-candidates` / 166*2abb3134SXin Li`print-candidates` functions in `demo.sh`. 167*2abb3134SXin Li 168*2abb3134SXin LiYou can also test what happens if you omit true strings from the candidates, by 169*2abb3134SXin Liediting the invocation of `print-candidates` in `run-dist`: 170*2abb3134SXin Li 171*2abb3134SXin Li # Example of omitting true values. Generate candidates from 172*2abb3134SXin Li # exp_true_inputs.txt, omitting values v1 and v2. 173*2abb3134SXin Li 174*2abb3134SXin Li print-candidates $dist 'v1|v2' > _tmp/${dist}_candidates.txt 175*2abb3134SXin Li 176*2abb3134SXin LiIn general, coming up with candidates is an application- or metric-specific 177*2abb3134SXin Liprocess. 178*2abb3134SXin Li 179*2abb3134SXin LiThe candidates are hashed by `hash_candidates.py` to create the "map" file, 180*2abb3134SXin Libefore being passed to R for analysis. 181*2abb3134SXin Li 182*2abb3134SXin Li**exp_map.csv** 183*2abb3134SXin Li 184*2abb3134SXin Li v1,8,13,30,22,37,37,53,53,77,67,89,86,97,97,118,128,139,136,157,<truncated> 185*2abb3134SXin Li v10,13,2,25,28,42,45,58,60,68,66,91,89,108,102,113,125,130,131,<truncated> 186*2abb3134SXin Li 187*2abb3134SXin LiThe map file has one row per candidate. In this case, there are 60 rows: 188*2abb3134SXin Li50 for the true values and 10 for "fake" values, which make the candidates a 189*2abb3134SXin Lisuperset of the true input. 190*2abb3134SXin Li 191*2abb3134SXin LiThe left most column is the raw candidate string. Then there are 128 more 192*2abb3134SXin Licolumns: for `m = 64` cohorts times `k = 2` hash functions in the Bloom filter. 193*2abb3134SXin Li 194*2abb3134SXin Li<!-- TODO: more detail about setting params? Examples of coming up with 195*2abb3134SXin Licandidate strings? --> 196*2abb3134SXin Li 197*2abb3134SXin Li5. RAPPOR Analysis 198*2abb3134SXin Li------------------ 199*2abb3134SXin Li 200*2abb3134SXin LiOnce you have the `counts`, `params`, and `map` files, you can pass it to the 201*2abb3134SXin Li`tests/analyze.R` tool, which is a small wrapper around the `analyze/R` 202*2abb3134SXin Lilibrary. 203*2abb3134SXin Li 204*2abb3134SXin LiYou will get a plot of the true distribution vs. the distribution recovered 205*2abb3134SXin Liwith the RAPPOR privacy algorithm. 206*2abb3134SXin Li 207*2abb3134SXin Li[View the example output][example]. 208*2abb3134SXin Li 209*2abb3134SXin LiYou can change the simulation parameters and RAPPOR parameters via flags, and 210*2abb3134SXin Licompare the resulting distributions. 211*2abb3134SXin Li 212*2abb3134SXin LiFor example, if you expect more unique values from clients, you should also use 213*2abb3134SXin Limore cohorts (i.e. raise `m`), to prevent hash function collisions from 214*2abb3134SXin Lidegrading the result quality. 215*2abb3134SXin Li 216*2abb3134SXin Li<!-- TODO: 217*2abb3134SXin Li - how to change flags 218*2abb3134SXin Li - more detail on what the various parameters do 219*2abb3134SXin Li - association analysis 220*2abb3134SXin Li - basic RAPPOR 221*2abb3134SXin Li - longitudinal privacy 222*2abb3134SXin Li--> 223*2abb3134SXin Li 224*2abb3134SXin LiConclusion 225*2abb3134SXin Li---------- 226*2abb3134SXin Li 227*2abb3134SXin LiRAPPOR allows you infer statistics about populations while preserving the 228*2abb3134SXin Liprivacy of individual clients. In this doc, we walked through a simple demo. 229*2abb3134SXin LiHowever, there are other variations of RAPPOR and settings in which you can use 230*2abb3134SXin LiRAPPOR, which we'll write more about. 231*2abb3134SXin Li 232*2abb3134SXin LiFeel free to send feedback on this doc to 233*2abb3134SXin Li[[email protected]](https://groups.google.com/forum/#!forum/rappor-discuss). 234*2abb3134SXin Li 235*2abb3134SXin Li 236*2abb3134SXin Li[README]: https://github.com/google/rappor/blob/master/README.md 237*2abb3134SXin Li[paper]: http://arxiv.org/abs/1407.6981 238*2abb3134SXin Li[example]: http://google.github.io/rappor/examples/report.html 239*2abb3134SXin Li 240