This notebook explores working with actigraphy time series recordings to get an idea of the data at hand.
- Investigate data quality (e.g., time gaps, non-wear periods, battery issues)
- Calculate main statistics for all participants
- Derive meaningful insights for feature engineering (circadian rhythms, time-based activity trends, activity levels, etc.)
The aim of this competition is to predict the Severity Impairment Index (sii), which measures the level of problematic internet use among children and adolescents, based on physical activity data and other features. sii is derived from PCIAT-PCIAT_Total, the sum of scores from the Parent-Child Internet Addiction Test (PCIAT: 20 questions, scored 0-5).
Target Variable (sii) is defined as:
- 0: None (PCIAT-PCIAT_Total from 0 to 30)
- 1: Mild (PCIAT-PCIAT_Total from 31 to 49)
- 2: Moderate (PCIAT-PCIAT_Total from 50 to 79)
- 3: Severe (PCIAT-PCIAT_Total 80 and more)
This makes sii an ordinal categorical variable with four levels, where the order of categories is meaningful.
Type of Machine Learning Problem we can use with sii as a target:
- Ordinal classification (ordinal logistic regression, models with custom ordinal loss functions)
- Multiclass classification (treat sii as a nominal categorical variable without considering the order)
- Regression (ignore the discrete nature of categories and treat sii as a continuous variable, then round prediction)
- Custom (e.g. loss functions that penalize errors based on the distance between categories)
We can also use PCIAT-PCIAT_Total as a continuous target variable, and implement regression on PCIAT-PCIAT_Total and then map predictions to sii categories.
Finally, another strategy involves predicting responses to each question of the Parent-Child Internet Addiction Test: i.e. pedict individual question scores as separate targets, sum the predicted scores to get the PCIAT-PCIAT_Total and map predictions to the corresponding sii category.
-
a first analysis of the data
-
how to cross-validate a model
-
that regression models are better than classification models in this competition, and
-
how to tune the thresholds for rounding the regression output.
-
The notebook uses polars DataFrames. If you are more fluent with pandas than with polars, this is an opportunity to get to know polars, which is often more efficient than pandas.
voted ensemble consisting of: (improving the robustness)