README.md
1# Private Join and Compute
2
3This project contains an implementation of the "Private Join and Compute"
4functionality. This functionality allows two users, each holding an input file,
5to privately compute the sum of associated values for records that have common
6identifiers.
7
8In more detail, suppose a Server has a file containing the following
9identifiers:
10
11Identifiers |
12----------- |
13Sam |
14Ada |
15Ruby |
16Brendan |
17
18And a Client has a file containing the following identifiers, paired with
19associated integer values:
20
21Identifiers | Associated Values
22----------- | -----------------
23Ruby | 10
24Ada | 30
25Alexander | 5
26Mika | 35
27
28Then the Private Join and Compute functionality would allow the Client to learn
29that the input files had *2* identifiers in common, and that the associated
30values summed to *40*. It does this *without* revealing which specific
31identifiers were in common (Ada and Ruby in the example above), or revealing
32anything additional about the other identifiers in the two parties' data set.
33
34Private Join and Compute is a variant of the well-studied Private Set
35Intersection functionality. We sometimes also refer to Private Join and Compute
36as Private Intersection-Sum.
37
38## How to run the protocol
39
40In order to run Private Join and Compute, you need to install Bazel, if you
41don't have it already.
42[Follow the instructions for your platform on the Bazel website.](https://docs.bazel.build/versions/master/install.html)
43
44You also need to install Git, if you don't have it already.
45[Follow the instructions for your platform on the Git website.](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
46
47Once you've installed Bazel and Git, open a Terminal and clone the Private Join
48and Compute repository into a local folder:
49
50```shell
51git clone https://github.com/google/private-join-and-compute.git
52```
53
54Navigate into the `private-join-and-compute` folder you just created, and build
55the Private Join and Compute library and dependencies using Bazel:
56
57```bash
58cd private-join-and-compute
59bazel build //private_join_and_compute:all
60```
61
62(All the following instructions must be run from inside the
63private-join-and-compute folder.)
64
65Next, generate some dummy data to run the protocol on:
66
67```shell
68bazel-bin/private_join_and_compute/generate_dummy_data --server_data_file=/tmp/dummy_server_data.csv \
69--client_data_file=/tmp/dummy_client_data.csv
70```
71
72This will create dummy data for the server and client at the specified
73locations. You can look at the files in `/tmp/dummy_server_data.csv` and
74`/tmp/dummy_client_data.csv` to see the dummy data that was generated. You can
75also change the size of the dummy data generated using additional flags. For
76example:
77
78```shell
79bazel-bin/private_join_and_compute/generate_dummy_data \
80--server_data_file=/tmp/dummy_server_data.csv \
81--client_data_file=/tmp/dummy_client_data.csv --server_data_size=1000 \
82--client_data_size=1000 --intersection_size=200 --max_associated_value=100
83```
84
85Once you've generated dummy data, you can start the server as follows:
86
87```shell
88bazel-bin/private_join_and_compute/server --server_data_file=/tmp/dummy_server_data.csv
89```
90
91The server will load data from the specified file, and wait for a connection
92from the client.
93
94Once the server is running, you can start a client to connect to the server.
95Create a new terminal and navigate to the private-join-and-compute folder. Once
96there, run the following command to start the client:
97
98```shell
99bazel-bin/private_join_and_compute/client --client_data_file=/tmp/dummy_client_data.csv
100```
101
102The client will connect to the server and execute the steps of the protocol
103sequentially. At the end of the protocol, the client will output the
104Intersection Size (the number of identifiers in common) and the Intersection Sum
105(the sum of associated values). If the protocol was successful, both the server
106and client will shut down.
107
108## Caveats
109
110Several caveats should be carefully considered before using Private Join and
111Compute.
112
113### Security Model
114
115Our protocol has security against honest-but-curious adversaries. This means
116that as long as both participants follow the protocol honestly, neither will
117learn more than the size of the intersection and the intersection-sum. However,
118if a participant deviates from the protocol, it is possible they could learn
119more than the prescribed information. For example, they could learn the specific
120identifiers in the intersection. If the underlying data is sensitive, we
121recommend performing a careful risk analysis before using Private Join and
122Compute, to ensure that neither party has an incentive to deviate from the
123protocol. The protocol can also be supplemented with external enforcement such
124as code audits to ensure that no party deviates from the protocol.
125
126### Maliciously Chosen Inputs
127
128We note that our protocol does not authenticate that parties use "real" input,
129nor does it prevent them from arbitrarily changing their input. We suggest
130careful analysis of whether any party has an incentive to lie about their
131inputs. This risk can also be mitigated by external enforcement such as code
132audits.
133
134### Leakage from the Intersection-Sum.
135
136While the Private Join and Compute functionality is supposed to reveal only the
137intersection-size and intersection-sum, it is possible that the intersection-sum
138itself could reveal something about which identifiers were in common.
139
140For example, if an identifier has a very unique associated integer values, then
141it may be easy to detect if that identifier was in the intersection simply by
142looking at the intersection-sum. One way this could happen is if one of the
143identifiers has a very large associated value compared to all other identifiers.
144In that case, if the intersection-sum is large, one could reasonably infer that
145that identifier was in the intersection. To mitigate this, we suggest scrubbing
146inputs to remove identifiers with "outlier" values.
147
148Another way that the intersection-sum may leak which identifiers are in the
149intersection is if the intersection is too small. This could make it easier to
150guess which combination of identifiers could be in the intersection in order to
151yield a particular intersection-sum. To mitigate this, one could abort the
152protocol if the intersection-size is below a certain threshold, or to add noise
153to the output of the protocol.
154
155(Note that these mitigations are not currently implemented in this open-source
156library.)
157
158## Disclaimers
159
160This is not an officially supported Google product. The software is provided
161as-is without any guarantees or warranties, express or implied.
162