Skip to content

This repository contains all public data, python scripts, and documentation relating to NIST Public Safety Communications Research Division's Differential Privacy program including past prize challenges and bechmark problem sets.

License

Notifications You must be signed in to change notification settings

xiyueyiwan/Differential-Privacy-Synthetic-Data-Challenge-assets

 
 

Repository files navigation

NIST PSCR 2018 Differential Privacy Temporal Map Challenge assets and links

Welcome!

This data repository contains assets to the 2018 Differential Privacy Synthetic Data Challenge hosted by the Public Safety Communication (PSCR) Division of the National Institute of Standards of Technology (NIST).

Please navigate to this link for information relating to the 2020 Temporal Map Challenge.

This repository is maintained by Gary Howarth, Prize Challenge Manager, NIST PSCR.

The main contents of the current repository are the public data and scoring functions for the Challenge (the Competitor's Packs) for each of the three sprints.

2018 Differential Privacy Synthetic Data Challenge Algorithms

De-identification Keywords: Differential Privacy, Synthetic Data Generation

Brief Description: Participants in Match #3 of NIST's 2018 Public Safety Communications Research Differential Privacy Synthetic Data Challenge developed these open source algorithms as part of an effort to advance differential privacy. Participants were challenged to create new methods, or improve existing methods of data de-identification, while preserving the dataset’s utility for analysis. All solutions were required to satisfy the differential privacy guarantee, a provable guarantee of individual privacy protection. Participants used a data set of emergency response events occurring in San Francisco and a sub-sample of the IPUMS USA data for the 1940 U.S. Census.

Contributions are listed in alphabetical order.

DP_WGAN-UCLANESL<

Team Members: Prof. Mani Srivastava (@msrivastava) - Team Captain (Match 1 and Match 3), Moustafa Alzantot (@malzantot) - (Match 1 and Match 3), Nat Snyder (@natsnyder1) - Match 1, Supriyo Charkaborty (@supriyogit) - Match 1

This repo contains an implementation for the award-winning solution to the 2018 Differential Privacy Synthetic Data Challenge by team UCLANESL. Our solution has been awarded the 5th place in Match #3 of the challenge and an earlier version has also won the 4th place in Match #1. The solution trains a wasserstein generative adversarial network (w-GAN) that is trained on the real private dataset. Differentially private training is applied by sanitizing (norm clipping and adding Gaussian noise) the gradients of the discriminator. Once the model is trained, it can be used to generate synthetic dataset by feeding random noise into the generator.

More Information | Link to Tool

DPFieldGroups

Team Members & Affiliation: John Gardner (no affiliation)
Brief Description: This is the fourth place entry in the third round of the NIST Differential Privacy Synthetic Data Challenge. The goal of this challenge is to produce differentially private synthetic data while retaining as much useful information as possible about the original data set. Colorado census data from 1940 with 98 field columns were provided for algorithm development with census data from other states used for testing. This solution groups together fields which have been found to be highly correlated. For each of these groups, a histogram is created for the purpose of counting the number of occurrences of every possible combination of values of all fields in the group. For privatization, Laplacian noise is added to every bin with scale proportional to the number of groups / total epsilon. Synthetic data is generated by selecting a random bin for each group with probability weighted by these noisy bin counts. The field values corresponding to each group's selected bin are written out as a single row of synthetic data.


Link to Tool and More Information

DPSyn

Team Members & Affiliations: Ninghui Li (Purdue University), Zhikun Zhang (Zhejiang University), Tianhao Wang (Purdue University)

Brief Description: We present DPSyn, an algorithm for synthesizing microdata while satisfying differential privacy, and its instantiation to the dataset used in the competition, namely Public Use Microdata Sample (PUMS) of the 1940 USA Census Data.

Link to Tool and More Information

rmckenna

Team Member & Affiliation: Ryan McKenna (UMass Amherst)

Brief Description: The first place entry in the third round of the NIST Differential Privacy Synthetic Data Challenge. The high-level idea is to (1) use the Gaussian mechanism to obtain noisy answers to a carefully selected set of counting queries (1, 2, and 3 way marginals) and (2) find a synthetic data set that approximates the true data with respect to those queries. The latter step is accomplished with [3], and the previous step uses ideas inspired by [1] and [2]. More specifically, this is done by calculating the mutual information (on the public dataset) for each pair of attributes and selecting the marginal queries that have high mutual information.

[1] Zhang, Jun, et al. "Privbayes: Private data release via bayesian networks." ACM Transactions on Database Systems (TODS) 42.4 (2017): 25.

[2] Chen, Rui, et al. "Differentially private high-dimensional data publication via sampling-based inference." Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015.

[3] McKenna, Ryan, Daniel Sheldon, and Gerome Miklau. "Graphical-model based estimation and inference for differential privacy." Proceddings of the 36th International Conference on Machine Learning. 2019.

Link to Tool and More Information

About

This repository contains all public data, python scripts, and documentation relating to NIST Public Safety Communications Research Division's Differential Privacy program including past prize challenges and bechmark problem sets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published