SexEst_Notebooks

Jupyter notebooks for training machine learning models that predict biological sex from skeletal measurements. These notebooks produce the models deployed in the SexEst web application.

Overview

Skeletal sex estimation is essential in osteoarchaeology and forensic anthropology. This project evaluates multiple machine learning classifiers on worldwide cranial and postcranial measurements and deploys the best-performing models (XGBoost, LightGBM, Linear Discriminant Analysis) in a free, open-source web application.

Notebooks

Notebook	Dataset	Measurements	Sample Size
`howell_dataset_analysis.ipynb`	Howells craniometric	32 cranial dimensions (GOL, NOL, BNL, BBH, XCB, etc.)	>2,500 individuals
`goldman_dataset_analysis.ipynb`	Goldman osteometric	11 postcranial dimensions (BIB, HML, HHD, FML, TML, etc.)	~1,500 individuals

What the notebooks do

Load data — Import CSV files from the datasets/ directory
Preprocess — Handle missing values (KNN imputation, iterative imputation), average left/right measurements, encode sex labels
Feature selection — Select measurement subsets for modelling
Model training — Train and compare 14 classifiers:
- Logistic Regression, Decision Tree, SVM, Gaussian Process
- Random Forest, AdaBoost, Gradient Boosting, Extra Trees
- Gaussian Naive Bayes, k-Nearest Neighbors
- Linear/Quadratic Discriminant Analysis
- XGBoost, LightGBM
Hyperparameter tuning — Grid search, randomized search, Bayesian optimization (scikit-optimize)
Export models — Save trained models as .dat files (pickle) for use in the SexEst app

Requirements

Python packages

pandas
numpy
scipy
matplotlib
seaborn
scikit-learn
xgboost
lightgbm
scikit-optimize
joblib

Environment setup

# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate   # Windows

# Install dependencies
pip install pandas numpy scipy matplotlib seaborn scikit-learn xgboost lightgbm scikit-optimize joblib

# Or with conda
conda create -n sexest python=3.10
conda activate sexest
conda install pandas numpy scipy matplotlib seaborn scikit-learn xgboost lightgbm
pip install scikit-optimize

Data files

The notebooks require CSV datasets that are NOT included in this repository. See DATA_AVAILABILITY.md for download instructions.

File	Description	Download
`datasets/Howell.csv`	Howells craniometric training data	Auerbach DATA page
`datasets/HowellTest.csv`	Howells craniometric test partition	Same as above
`datasets/Goldman.csv`	Goldman osteometric data	Same as above

Running the notebooks

Download datasets and place in datasets/ directory
Open notebooks in Jupyter or VS Code
Run cells sequentially — the notebooks are designed to be executed top-to-bottom
Trained models will be saved as .dat files in the working directory

# Start Jupyter
jupyter notebook

# Or in VS Code, open .ipynb files directly

Output

The notebooks export trained models as pickle files:

lda_model_*.dat — Linear Discriminant Analysis
xgb_model_*.dat — XGBoost
lgb_model_*.dat — LightGBM

These models are used by the SexEst Streamlit app.

Citation

If you use these notebooks or models, please cite:

Constantinou, C., et al. (2023). SexEst: A machine learning web application for skeletal sex estimation. International Journal of Osteoarchaeology. https://doi.org/10.1002/oa.3109

@article{constantinou2022sexest,
  title={SexEst: An open access web application for metric skeletal sex estimation},
  author={Constantinou, Chrysovalantis and Nikita, Efthymia},
  journal={International Journal of Osteoarchaeology},
  volume={32},
  number={4},
  pages={832--844},
  year={2022},
  publisher={Wiley Online Library}
}

Links

Resource	URL
Paper	https://doi.org/10.1002/oa.3109
Live demo	http://sexest.cyi.ac.cy/
Web app repository	https://github.com/cconsta1/SexEst.git
Training notebooks	https://github.com/cconsta1/SexEst_Notebooks.git
Original datasets	https://web.utk.edu/~auerbach/DATA.htm

License

Apache License 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
DATA_AVAILABILITY.md		DATA_AVAILABILITY.md
LICENSE		LICENSE
README.md		README.md
goldman_dataset_analysis.ipynb		goldman_dataset_analysis.ipynb
howell_dataset_analysis.ipynb		howell_dataset_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SexEst_Notebooks

Overview

Notebooks

What the notebooks do

Requirements

Python packages

Environment setup

Data files

Running the notebooks

Output

Citation

Links

License

About

Uh oh!

Releases

Packages

Languages

License

cconsta1/SexEst_Notebooks

Folders and files

Latest commit

History

Repository files navigation

SexEst_Notebooks

Overview

Notebooks

What the notebooks do

Requirements

Python packages

Environment setup

Data files

Running the notebooks

Output

Citation

Links

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages