Skip to content

Dependable-Intelligent-Systems-Lab/Dataset-Characteristics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License: MIT Standard - \Python Style Guide

D-ACE: Dataset Assessment and Characteristics Evaluation

Dataset quality assessment is a crucial aspect of machine learning and artificial intelligence, as the performance and accuracy of algorithms are directly dependent on the quality and characteristics of the data they are trained on. Poor quality datasets can lead to biased or inaccurate results, leading to incorrect decisions being made. Hence, it is important to measure the quality of datasets and identify any potential issues before using them for training machine learning models.
D-ACE is a framework designed to assess the quality and characteristics of datasets, helping to identify any potential issues that may affect the performance of machine learning algorithms. This framework provides a comprehensive evaluation of the dataset, taking into account factors such as missing values, class imbalance, data heterogeneity, and more. D-ACE can be used to improve the dependability of machine learning algorithms by providing a detailed evaluation of the dataset and identifying any potential issues that may affect the performance of the algorithms. By addressing these issues and ensuring the quality of the dataset, the performance of machine learning algorithms can be improved, leading to more accurate and reliable results. In general, D-ACE can be a valuable tool for measuring the quality of datasets and ensuring the dependability of machine learning algorithms.

Currently Supporting Characteristics:

Characteristics Characteristics
Dimensionality (d) NrOfInstances (N)
NrOfClasses (C) ZeroSparsity (OS)
NaNSparsity (NS) DataSparsity (DS)
DataSparsityRatio (DSR) Correlation of Features with Class (CorrFC)
Correlation of Features without Class (CorrFNC) Multivariate Normality (MVN)
Homogeneity of class covariance (HCCov) Intrinsic Dimensionality-PCA (ID)
Intrinsic Dimensionality Ratio (IDR) Feature Noise variance (FN1)
Feature Noise paper (FN2)

To-do:

  • adding dataset separability evaluation metrics
  • adding geometric characteristics
  • adding miss-labeling ratio
  • adding algorithms like Data Shapley: Ghorbani, A., & Zou, J. (2019, May). Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning (pp. 2242-2251). PMLR.
  • Checking for the dataset balance with respect to sensitive features for fairness evaluation

Collaborators

Dependable Intelligent Systems Lab., University of Hull

Fraunhofer Institute for Experimental Software Engineering

Contributors

  • Jerin Antony
  • Akinwande Adegbola
  • Zhibao Mian
  • Septavera Sharvia
  • Koorosh Aslansefat
  • Mohammad Naveed Akram
  • Iannis Sorokos
  • Yiannis Papadopoulos

License

This framework is available under the MIT License.

About

Dataset Characteristics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published