Data-Centric-L-Diversity-Model

This repository encloses implementation Codes/Functions used in the Data-centric L-diversity (DCLD) model (Synthetic data-aided anonymization model).

Main Modules of the proposed DCLD
There are two key modules in the proposed DCLD model.
1- Data Pre-processing
In this module, the data to be anonymized is pre-processed to make it anonymization-ready. The key difference between pre-processing and underlying data is that it is free of different vulnerabilities. Although some methods employ pre-processing, only a few areas are improved. This work innovates the existing work and amalgamates synthetic data with real data to address the distribution imbalance problem. When the class imbalance problem is addressed then the constraints regarding SA values and distributions are effectively met whereas most of the existing methods do not fix this problem, leading to two crucial design problems (i.e., expose privacy or leave many records un-processed).**
2- Shallow Anonymization
In this module, minimal necessary anonymization is applied to curate high-quality data. Specifically, the pattern-friendly attributes are first identified using the customized implementation of random forest and are minimally generalized to yield privacy-preserved data. It is worth noting that privacy is not risked due to minimal generalization as there exists higher uncertainty in the SA column.
..... Next, we provide the details of the implementation that can help the re-implementation of the proposed model.

Dataset used in Experimentation

Four real-world publicly available datasets have been used to evaluate the effectiveness of the proposed DCLD.

1-Adult dataset

This is a reasonable-sized dataset encompassing US individuals' diverse information (demographics). The database and privacy community have widely used this dataset for experimentation purposes. Its original form is available at http://archive.ics.uci.edu/dataset/2/adult.

2- Stroke Prediction dataset

This data set has been widely used in machine learning, particularly in imbalanced learning problems. Due to its higher imbalance, it has also been widely used in many AI competitions. This dataset in its original form is available at https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset.

3- Census Income dataset

This is the largest dataset encompassing the diverse information of individuals. The database and privacy community has widely used this dataset also for experimentation purposes. This dataset in its original form is available at:http://archive.ics.uci.edu/dataset/117/census+income+kdd.

4- Diabetes 130-US Hospitals dataset

This is also the largest dataset encompassing the diverse medical information of individuals fetched from the clinical care at 130 US hospitals and integrated delivery networks. The database and privacy community have also used this dataset for experimentation purposes. This dataset in its original form is available at [https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008].

Implementation process

At the outset, it is imperative to install all required libraries. Depending upon the programming language, the necessary libraries should be included in the development environment. Below we show a sample to install basic data processing libraries and installation procedures in Python. pip install name of library i.e., pip install numpy, pip install pandas, pip install scikit-learn, pip install scipy.

Below, we provide the code information that can help understand the implementation and reproduce the results.

File Name	Description of implementation	Output files/information
Arrange_Attributes.py	Remove direct identifiers and arrange the remaining attributes	Data with QIDs and SA only (The last column is SA)
Missing_Values_Imputation (When duplicate Removal is needed ).py	Clean the data from basic vulnerabilities (missing values, outliers, duplicates, etc.)	Data with basic vulnerabilities fixed
Missing_Values_Imputation (When duplicate Removal is not needed ).py	Clean the data from basic vulnerabilities (missing values, outliers, etc.)	Data with basic vulnerabilities fixed
Imbalance_Ratio_Computing_Records Analysis.py	Analyze the imbalance w.r.t. SA & find the # of records needed for balance	Imbalance ratio information, and size of Dnew required for data balancing
Interface_Program_SD_Generation.py	Generating synthetic data¹ to balance the distribution of rare SA value	Synthetic data with identical structure to real data
Data_Balancing_by_Adding_Dnew.py	Generating balancing data by mixing Dnew and real data (only augmenting the rare SA class)	Balanced and clean dataset (Most vulnerabilities are fixed)
Feature_Scores (Best value Combinations are Desirable).py	Identifying pattern friendly QIDs from the data	Scores of the QIDs w.r.t. pattern information
KLDCriteria_Aware_Grouping.py	Clustering data as per k and l value	Clustered data where the size of each cluster is at least k and every cluster is 2-diverse
QIDs-Values_Replacements.py	Generalized data with lower level generalization	Generalized data where the functional relationship between real and anonymized data is high

The file generalization_mappings.json provides a generalized hierarchy information sample for QIDs that can assist in generalization when called from the main program. Further information regarding the construction of heirarchies can be learned from a recent study². The file requriements.txt provides the Python libraries that are required to execute the code.

Citing DCLD

If you use the DCLD implementation, please cite the following work:

[1] A. Majeed and S. O. Hwang, "A Data-Centric ℓ -Diversity Model for Securely Publishing Personal Data With Enhanced Utility," in IEEE Transactions on Big Data, doi: \url{10.1109/TBDATA.2024.3524832}

The open-source implementation was used with slight modifications (https://github.com/sdv-dev/CTGAN). ↩
The information/details about the generalization heirarchies (https://www.sciencedirect.com/science/article/pii/S2667305323000923). ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Centric-L-Diversity-Model

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
Arrange_Attributes.py		Arrange_Attributes.py
Data_Balancing_by_Adding_Dnew.py		Data_Balancing_by_Adding_Dnew.py
Feature_Scores (Best value Combinations is Desirable).py		Feature_Scores (Best value Combinations is Desirable).py
Imbalance_Ratio_Computing_Records Analysis.py		Imbalance_Ratio_Computing_Records Analysis.py
Interface_Program_SD_Generation.py		Interface_Program_SD_Generation.py
KLDCriteria_Aware_Grouping.py		KLDCriteria_Aware_Grouping.py
Missing_Values_Imputation (When duplicates Removal is needed).py		Missing_Values_Imputation (When duplicates Removal is needed).py
Missing_Values_Imputation (When duplicates Removal is not needed).py		Missing_Values_Imputation (When duplicates Removal is not needed).py
QIDs-Values_Replacements.py		QIDs-Values_Replacements.py
README.md		README.md
Requirements.txt		Requirements.txt
generalization_mappings.json		generalization_mappings.json

AbdulMajeed09398/Data-Centric-L-Diversity-Model

Folders and files

Latest commit

History

Repository files navigation

Data-Centric-L-Diversity-Model

Footnotes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages