xref: /aosp_15_r20/external/rappor/doc/data-flow.md (revision 2abb31345f6c95944768b5222a9a5ed3fc68cc00)
1*2abb3134SXin LiRAPPOR Data Flow
2*2abb3134SXin Li================
3*2abb3134SXin Li
4*2abb3134SXin LiThis doc explains the simulation tools and data formats in the [RAPPOR
5*2abb3134SXin Lirepository](https://github.com/google/rappor).  We'll focus on the code, and
6*2abb3134SXin Lidescribe the algorithm only informally.  For details, see the [paper][].
7*2abb3134SXin Li
8*2abb3134SXin LiOverview
9*2abb3134SXin Li--------
10*2abb3134SXin Li
11*2abb3134SXin LiStart with this command:
12*2abb3134SXin Li
13*2abb3134SXin Li    $ ./demo.sh run
14*2abb3134SXin Li
15*2abb3134SXin LiIt takes a minute or so to run.  The dependencies listed in the
16*2abb3134SXin Li[README][] must be installed.  At the end, it will say:
17*2abb3134SXin Li
18*2abb3134SXin Li    Wrote _tmp/report.html.  Open this in your browser.
19*2abb3134SXin Li
20*2abb3134SXin LiIt should look like [this][example].
21*2abb3134SXin Li
22*2abb3134SXin LiThe following diagram shows what processes and files are involved in the demo.
23*2abb3134SXin LiOvals represent **processes**; rectangles represent **data**.  The dotted lines
24*2abb3134SXin Lidenote components that are involved in the simulation, but wouldn't be used in
25*2abb3134SXin Lia "real" setting.
26*2abb3134SXin Li
27*2abb3134SXin LiIn most configurations, reporting (in blue) is done by client machines, while
28*2abb3134SXin Lianalysis (in green) is done by a server.
29*2abb3134SXin Li
30*2abb3134SXin Li<img src="data-flow.png" alt="Diagram of RAPPOR Data Flow" />
31*2abb3134SXin Li
32*2abb3134SXin LiIn the simulation, reporting consists of these steps:
33*2abb3134SXin Li
34*2abb3134SXin Li  1. Generate simulated input data with different distributions.
35*2abb3134SXin Li  2. Obscure each value with the RAPPOR privacy-preserving reporting mechanism.
36*2abb3134SXin Li
37*2abb3134SXin LiAnalysis consists of these steps:
38*2abb3134SXin Li
39*2abb3134SXin Li  1. Aggregate the reports by summing bits (i.e. make a counting Bloom filter)
40*2abb3134SXin Li  2. Come up with candidate strings, and hash them in the same manner as the
41*2abb3134SXin Li  client.
42*2abb3134SXin Li  3. Using the reports, RAPPOR parameters, and candidate strings as input,
43*2abb3134SXin Li  infer the distribution of true values.  We don't see the values themselves.
44*2abb3134SXin Li  We plot the true and inferred distributions side by side for comparison.
45*2abb3134SXin Li
46*2abb3134SXin LiThis process is described in detail below.
47*2abb3134SXin Li
48*2abb3134SXin Li1. Generating Simulated Input
49*2abb3134SXin Li-----------------------------
50*2abb3134SXin Li
51*2abb3134SXin LiThe `tests/gen_sim_input.py` tool generates CSV data, like this:
52*2abb3134SXin Li
53*2abb3134SXin Li<!-- TODO: a realistic data set would be nice? How could we generate one?  -->
54*2abb3134SXin Li
55*2abb3134SXin Li**exp.csv**
56*2abb3134SXin Li
57*2abb3134SXin Li    client, true_value
58*2abb3134SXin Li    1,      v6
59*2abb3134SXin Li    1,      v3
60*2abb3134SXin Li    1,      v3
61*2abb3134SXin Li    1,      v5
62*2abb3134SXin Li    1,      v13
63*2abb3134SXin Li    1,      v1
64*2abb3134SXin Li    1,      v8
65*2abb3134SXin Li    2,      v2
66*2abb3134SXin Li    2,      v3
67*2abb3134SXin Li    2,      v1
68*2abb3134SXin Li    2,      v8
69*2abb3134SXin Li    2,      v1
70*2abb3134SXin Li    2,      v30
71*2abb3134SXin Li    2,      v10
72*2abb3134SXin Li    3,      v4
73*2abb3134SXin Li    ...
74*2abb3134SXin Li
75*2abb3134SXin Li*(spaces added for clarity)*
76*2abb3134SXin Li
77*2abb3134SXin LiBy default we generate 700,000 rows: 7 random values from `v1` to `v50` for
78*2abb3134SXin Lieach client.  These can be thought of as a variable being reported over time.
79*2abb3134SXin Li
80*2abb3134SXin LiWe're simulating an environment where there are many RAPPOR clients, and a
81*2abb3134SXin Lisingle server does the RAPPOR analysis on the accumulated data.
82*2abb3134SXin Li
83*2abb3134SXin LiThe `client` is represented by an integer ID.  The `true_value` should **not**
84*2abb3134SXin Libe sent over the network because we wish to preserve the client's privacy.
85*2abb3134SXin Li
86*2abb3134SXin Li
87*2abb3134SXin Li2. RAPPOR Reporting
88*2abb3134SXin Li-------------------
89*2abb3134SXin Li
90*2abb3134SXin LiThe `tests/rappor_sim.py` tool uses the Python client library
91*2abb3134SXin Li(`client/python/rappor.py`) to obscure the `v1` .. `vN` strings.  We want to
92*2abb3134SXin Liinfer the distribution of these strings over the entire population, but we
93*2abb3134SXin Lidon't want to know any individual values.
94*2abb3134SXin Li
95*2abb3134SXin LiAfter the RAPPOR transformation, we get another CSV file with 700,000 rows.
96*2abb3134SXin LiEach client is assigned a cohort.
97*2abb3134SXin Li
98*2abb3134SXin Li**exp_out.csv**
99*2abb3134SXin Li
100*2abb3134SXin Li    client, cohort, rappor
101*2abb3134SXin Li    1,      63,     1111101011110111
102*2abb3134SXin Li    1,      15,     1110110011111100
103*2abb3134SXin Li    1,      12,     0110101111100101
104*2abb3134SXin Li    1,       0,     1111100111110111
105*2abb3134SXin Li    1,       3,     1001110111110011
106*2abb3134SXin Li    1,      14,     1011111010110011
107*2abb3134SXin Li    1,      33,     0111010100101011
108*2abb3134SXin Li    2,      40,     0011011010101001
109*2abb3134SXin Li    2,      35,     1010110101110100
110*2abb3134SXin Li    2,      58,     1110110110111110
111*2abb3134SXin Li    2,      38,     0010001111001010
112*2abb3134SXin Li    2,       5,     1110111011100101
113*2abb3134SXin Li    2,      36,     0111010100111111
114*2abb3134SXin Li    2,      39,     0101101000101101
115*2abb3134SXin Li    3,      32,     0011100111111110
116*2abb3134SXin Li    ...
117*2abb3134SXin Li
118*2abb3134SXin Li*(spaces added for clarity)*
119*2abb3134SXin Li
120*2abb3134SXin LiWe also get a one-row CSV file that contains the RAPPOR parameters:
121*2abb3134SXin Li
122*2abb3134SXin Li**exp_params.csv**
123*2abb3134SXin Li
124*2abb3134SXin Li    k,h,m,p,q,f
125*2abb3134SXin Li    16,2,64,0.5,0.75,0.5
126*2abb3134SXin Li
127*2abb3134SXin LiThese are described in the [paper][]. The parameters that the clients use
128*2abb3134SXin Limust be known to the server, or the decoding will fail.  In addition, all
129*2abb3134SXin Liclients must use the same parameters for a given variable.
130*2abb3134SXin Li
131*2abb3134SXin LiThe `rappor_sim.py` process also writes these files:
132*2abb3134SXin Li
133*2abb3134SXin Li- `exp_hist.csv`: The true histogram of input values.  This is used only for
134*2abb3134SXin Li  comparison.  In the real world you obviously won't have this.
135*2abb3134SXin Li- `exp_true_inputs.txt`: A list of the unique values reported, e.g. `v1` ..
136*2abb3134SXin Li  `v50`.  You won't have this either, in general.  To use RAPPOR, you must
137*2abb3134SXin Li  supply *candidate strings*, described below.
138*2abb3134SXin Li
139*2abb3134SXin Li3. Report Aggregation
140*2abb3134SXin Li---------------------
141*2abb3134SXin Li
142*2abb3134SXin Li`sum_bits.py` takes the `exp_out.csv` output, and produces the "counts" file:
143*2abb3134SXin Li
144*2abb3134SXin Li**exp_counts.csv**
145*2abb3134SXin Li
146*2abb3134SXin Li    11116,6273,6433,6347,6385,6290,6621,6359,6747,6623,6321,6696,6282,6652,6368,6286,6222
147*2abb3134SXin Li    10861,6365,6263,6170,6258,6107,6633,6171,6226,6123,6286,6254,6408,6182,6442,6195,6187
148*2abb3134SXin Li    ...
149*2abb3134SXin Li
150*2abb3134SXin LiThe file has 64 rows, because the simulation has 64 cohorts by default (`m =
151*2abb3134SXin Li64`).  This parameter should be adjusted based on the number of unique true
152*2abb3134SXin Livalues expected.  <!-- TODO: more detail -->
153*2abb3134SXin Li
154*2abb3134SXin LiThere are 17 columns.  The left-most column is the total number of reports in
155*2abb3134SXin Lithe cohort.  The remaining 16 columns correspond to the `k = 16` bits in the
156*2abb3134SXin LiBloom filter.  Each column contains the number of reports with that bit set
157*2abb3134SXin Lito 1.
158*2abb3134SXin Li
159*2abb3134SXin LiSo, in general, the "counts" file is a `(k+1) * m` matrix.
160*2abb3134SXin Li
161*2abb3134SXin Li4. Candidate Strings
162*2abb3134SXin Li--------------------
163*2abb3134SXin Li
164*2abb3134SXin LiIn the simulation, we assume that the analyst will come up with a *superset* of
165*2abb3134SXin Lithe candidate strings.  This is done in the `more-candidates` /
166*2abb3134SXin Li`print-candidates` functions in `demo.sh`.
167*2abb3134SXin Li
168*2abb3134SXin LiYou can also test what happens if you omit true strings from the candidates, by
169*2abb3134SXin Liediting the invocation of `print-candidates` in `run-dist`:
170*2abb3134SXin Li
171*2abb3134SXin Li    # Example of omitting true values.  Generate candidates from
172*2abb3134SXin Li    # exp_true_inputs.txt, omitting values v1 and v2.
173*2abb3134SXin Li
174*2abb3134SXin Li    print-candidates $dist 'v1|v2'  > _tmp/${dist}_candidates.txt
175*2abb3134SXin Li
176*2abb3134SXin LiIn general, coming up with candidates is an application- or metric-specific
177*2abb3134SXin Liprocess.
178*2abb3134SXin Li
179*2abb3134SXin LiThe candidates are hashed by `hash_candidates.py` to create the "map" file,
180*2abb3134SXin Libefore being passed to R for analysis.
181*2abb3134SXin Li
182*2abb3134SXin Li**exp_map.csv**
183*2abb3134SXin Li
184*2abb3134SXin Li    v1,8,13,30,22,37,37,53,53,77,67,89,86,97,97,118,128,139,136,157,<truncated>
185*2abb3134SXin Li    v10,13,2,25,28,42,45,58,60,68,66,91,89,108,102,113,125,130,131,<truncated>
186*2abb3134SXin Li
187*2abb3134SXin LiThe map file has one row per candidate.  In this case, there are 60 rows:
188*2abb3134SXin Li50 for the true values and 10 for "fake" values, which make the candidates a
189*2abb3134SXin Lisuperset of the true input.
190*2abb3134SXin Li
191*2abb3134SXin LiThe left most column is the raw candidate string.  Then there are 128 more
192*2abb3134SXin Licolumns: for `m = 64` cohorts times `k = 2` hash functions in the Bloom filter.
193*2abb3134SXin Li
194*2abb3134SXin Li<!-- TODO: more detail about setting params?  Examples of coming up with
195*2abb3134SXin Licandidate strings? -->
196*2abb3134SXin Li
197*2abb3134SXin Li5. RAPPOR Analysis
198*2abb3134SXin Li------------------
199*2abb3134SXin Li
200*2abb3134SXin LiOnce you have the `counts`, `params`, and `map` files, you can pass it to the
201*2abb3134SXin Li`tests/analyze.R` tool, which is a small wrapper around the `analyze/R`
202*2abb3134SXin Lilibrary.
203*2abb3134SXin Li
204*2abb3134SXin LiYou will get a plot of the true distribution vs. the distribution recovered
205*2abb3134SXin Liwith the RAPPOR privacy algorithm.
206*2abb3134SXin Li
207*2abb3134SXin Li[View the example output][example].
208*2abb3134SXin Li
209*2abb3134SXin LiYou can change the simulation parameters and RAPPOR parameters via flags, and
210*2abb3134SXin Licompare the resulting distributions.
211*2abb3134SXin Li
212*2abb3134SXin LiFor example, if you expect more unique values from clients, you should also use
213*2abb3134SXin Limore cohorts (i.e. raise `m`), to prevent hash function collisions from
214*2abb3134SXin Lidegrading the result quality.
215*2abb3134SXin Li
216*2abb3134SXin Li<!-- TODO:
217*2abb3134SXin Li     - how to change flags
218*2abb3134SXin Li     - more detail on what the various parameters do
219*2abb3134SXin Li     - association analysis
220*2abb3134SXin Li     - basic RAPPOR
221*2abb3134SXin Li     - longitudinal privacy
222*2abb3134SXin Li-->
223*2abb3134SXin Li
224*2abb3134SXin LiConclusion
225*2abb3134SXin Li----------
226*2abb3134SXin Li
227*2abb3134SXin LiRAPPOR allows you infer statistics about populations while preserving the
228*2abb3134SXin Liprivacy of individual clients.  In this doc, we walked through a simple demo.
229*2abb3134SXin LiHowever, there are other variations of RAPPOR and settings in which you can use
230*2abb3134SXin LiRAPPOR, which we'll write more about.
231*2abb3134SXin Li
232*2abb3134SXin LiFeel free to send feedback on this doc to
233*2abb3134SXin Li[[email protected]](https://groups.google.com/forum/#!forum/rappor-discuss).
234*2abb3134SXin Li
235*2abb3134SXin Li
236*2abb3134SXin Li[README]: https://github.com/google/rappor/blob/master/README.md
237*2abb3134SXin Li[paper]: http://arxiv.org/abs/1407.6981
238*2abb3134SXin Li[example]: http://google.github.io/rappor/examples/report.html
239*2abb3134SXin Li
240