Skip to content

Public Repository: Machine Learning & Data Mining project using the South African Heart Disease dataset. Applied PCA, Regularized Linear Regression, ANN, Logistic Regression, and Decision Trees with cross-validation for regression and classification. Includes feature scaling, EDA, and statistical tests.

Notifications You must be signed in to change notification settings

Davide011/ML_project_South_African_Heart_Disease

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Machine learning and data mining

Python scikit-learn Pandas NumPy Matplotlib License


Project

🧠 South African Heart Disease Data Analysis

This repository presents an in-depth analysis of the South African Heart Disease dataset, exploring how various risk factors contribute to coronary heart disease (CHD). The project combines machine learning, statistical analysis, and data visualization to investigate data structure, correlations, and predictive potential.

πŸ“Š Project Overview

Dataset: Subset of the CORIS study (Rousseauw et al., 1983), including 462 individuals and 10 health-related attributes.

Objective: Identify significant CHD predictors and evaluate whether linear dimensionality reduction (PCA) can capture discriminative patterns between healthy and CHD-affected individuals.

πŸ”¬ Methods & Workflow

Data Preprocessing – Standardization, outlier removal, and normalization to address variable scale imbalance.

Exploratory Data Analysis (EDA) – Statistical summaries, boxplots, and correlation heatmaps to visualize relationships.

Dimensionality Reduction (PCA) – Performed using Singular Value Decomposition (SVD) to retain 90% of data variance.

Model Evaluation – Comparison between raw and standardized PCA projections; assessment of variance explained and component significance.

πŸ“ˆ Findings

Standardization was critical due to large variance differences across features.

PCA alone was insufficient for accurate CHD classification, as clusters between classes overlapped heavily.

Insights point toward Logistic Regression or nonlinear ML models (e.g., Random Forests, SVM) for improved performance.

βš™οΈ Skills Demonstrated

Data preprocessing and feature scaling

Exploratory data visualization

Principal Component Analysis (PCA)

Model evaluation and variance analysis

Scientific communication and reporting

🧩 Technologies

Python, NumPy, Pandas, Matplotlib, scikit-learn

Group 5

Authors:

  • Aleksander Nagaj
  • Davide Ventuo
  • Filippo Bosi

Dataset

South african heart disease

About

Public Repository: Machine Learning & Data Mining project using the South African Heart Disease dataset. Applied PCA, Regularized Linear Regression, ANN, Logistic Regression, and Decision Trees with cross-validation for regression and classification. Includes feature scaling, EDA, and statistical tests.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published