Jupyter notebooks for training machine learning models that predict biological sex from skeletal measurements. These notebooks produce the models deployed in the SexEst web application.
Skeletal sex estimation is essential in osteoarchaeology and forensic anthropology. This project evaluates multiple machine learning classifiers on worldwide cranial and postcranial measurements and deploys the best-performing models (XGBoost, LightGBM, Linear Discriminant Analysis) in a free, open-source web application.
| Notebook | Dataset | Measurements | Sample Size |
|---|---|---|---|
howell_dataset_analysis.ipynb |
Howells craniometric | 32 cranial dimensions (GOL, NOL, BNL, BBH, XCB, etc.) | >2,500 individuals |
goldman_dataset_analysis.ipynb |
Goldman osteometric | 11 postcranial dimensions (BIB, HML, HHD, FML, TML, etc.) | ~1,500 individuals |
- Load data — Import CSV files from the
datasets/directory - Preprocess — Handle missing values (KNN imputation, iterative imputation), average left/right measurements, encode sex labels
- Feature selection — Select measurement subsets for modelling
- Model training — Train and compare 14 classifiers:
- Logistic Regression, Decision Tree, SVM, Gaussian Process
- Random Forest, AdaBoost, Gradient Boosting, Extra Trees
- Gaussian Naive Bayes, k-Nearest Neighbors
- Linear/Quadratic Discriminant Analysis
- XGBoost, LightGBM
- Hyperparameter tuning — Grid search, randomized search, Bayesian optimization (scikit-optimize)
- Export models — Save trained models as
.datfiles (pickle) for use in the SexEst app
pandas
numpy
scipy
matplotlib
seaborn
scikit-learn
xgboost
lightgbm
scikit-optimize
joblib
# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows
# Install dependencies
pip install pandas numpy scipy matplotlib seaborn scikit-learn xgboost lightgbm scikit-optimize joblib
# Or with conda
conda create -n sexest python=3.10
conda activate sexest
conda install pandas numpy scipy matplotlib seaborn scikit-learn xgboost lightgbm
pip install scikit-optimizeThe notebooks require CSV datasets that are NOT included in this repository. See DATA_AVAILABILITY.md for download instructions.
| File | Description | Download |
|---|---|---|
datasets/Howell.csv |
Howells craniometric training data | Auerbach DATA page |
datasets/HowellTest.csv |
Howells craniometric test partition | Same as above |
datasets/Goldman.csv |
Goldman osteometric data | Same as above |
- Download datasets and place in
datasets/directory - Open notebooks in Jupyter or VS Code
- Run cells sequentially — the notebooks are designed to be executed top-to-bottom
- Trained models will be saved as
.datfiles in the working directory
# Start Jupyter
jupyter notebook
# Or in VS Code, open .ipynb files directlyThe notebooks export trained models as pickle files:
lda_model_*.dat— Linear Discriminant Analysisxgb_model_*.dat— XGBoostlgb_model_*.dat— LightGBM
These models are used by the SexEst Streamlit app.
If you use these notebooks or models, please cite:
Constantinou, C., et al. (2023). SexEst: A machine learning web application for skeletal sex estimation. International Journal of Osteoarchaeology. https://doi.org/10.1002/oa.3109
@article{constantinou2022sexest,
title={SexEst: An open access web application for metric skeletal sex estimation},
author={Constantinou, Chrysovalantis and Nikita, Efthymia},
journal={International Journal of Osteoarchaeology},
volume={32},
number={4},
pages={832--844},
year={2022},
publisher={Wiley Online Library}
}| Resource | URL |
|---|---|
| Paper | https://doi.org/10.1002/oa.3109 |
| Live demo | http://sexest.cyi.ac.cy/ |
| Web app repository | https://github.com/cconsta1/SexEst.git |
| Training notebooks | https://github.com/cconsta1/SexEst_Notebooks.git |
| Original datasets | https://web.utk.edu/~auerbach/DATA.htm |
Apache License 2.0. See LICENSE.