When signaficant amount of data in highly-important features are missing, what can we do? Impute the missing data with mean or median? In this Juyter notebook, I demonstrate embedding a XGBoost model to do the data imputation in the data transformer.
In this dataset, a lot of "cost" data missing, but they are quite important to predict "price".
If we impute the missing "cost" with its mean or median, there will be a spike in the imputed dataset. In contrast, imputing the missing "cost" with a XGBoost regressor which is embedded in the data transformer and predicts "cost" from other features is very effective.