XgBoost as nuisance model in DML learners (categorical features) #907

erasedcitizen11 · 2024-08-05T14:58:49Z

Hi there,

I'm looking to use xgboost as my nuisance model in my DoubleML setup and use xgboost's own mechanism for encoding categorical features (rather than having to one hot encode them myself).

I can do this quite easily in xgboost using the scikit-learn interface:

from econml.dml import NonParamDML
import numpy as np
import scipy
import pandas as pd
import xgboost as xgb
np.random.seed(123)

#numeric feats
X_num = pd.DataFrame(np.random.normal(size=(1000, 2)))

#categorical feat
X_cat = pd.Series(np.random.randint(0,3,1000))
X_cat=X_cat.replace({0:'zero',1:'one',2:'two'}).astype('category')
X=pd.concat([X_num,X_cat],axis=1)
X.columns=['num_feat_1','num_feat_2','cat_feat_1']

T = pd.Series(np.random.normal(size=(1000)))
y = pd.Series(np.random.randint(0,2,1000))


training_params={
    'objective': 'binary:logistic',
    'booster': 'gbtree',      
    'eval_metric': 'logloss',
    'enable_categorical':True,
    'max_cat_to_onehot':2,
    'learning_rate':0.1,
    'max_depth': 6
}

clf = xgb.XGBClassifier(**training_params)
clf.fit(X,y)
clf.predict_proba(X)

However, using the DML classes results in an error. The param that allows us to use xgboost's encoding of categorical features is enable_categorical=True but it appears econml is not able to pass this information onto xgboost.

Is there a way to get around this? I have some high cardinality categorical features that I would rather not have to one hot encode, hence why I have used xgboost as my nuisance model.

est = NonParamDML(
    model_y=xgb.XGBClassifier(**training_params),
    model_t=xgb.XGBRegressor(),
    model_final=xgb.XGBRegressor(),
    discrete_treatment=False,
    discrete_outcome=True,
    allow_missing=True,
)
est.fit(y, T, X=X, W=None)

The text was updated successfully, but these errors were encountered:

kbattocchi · 2024-08-07T14:48:10Z

What error are you getting? We should be using the model as-is however you instantiate it.

One issue that you might run into is that we are internally doing cross-fitting with the model, so if you have rare values in that column it's possible that when we fit an instance on a subset of the data we don't see one of those values but when we predict on the held-out set we do, leading to an error. If that's the issue, then you can provide your own cross-fitting folds through the cv argument which stratify on the values of that column and the outcome, rather than using the default (which will otherwise stratify only on the outcome since that's discrete).

erasedcitizen11 · 2024-08-07T21:56:59Z

Thanks for the response. In the example above, I get:
ValueError: could not convert string to float: 'zero'

So xgboost thinks this feature is numeric when it is categorical when I call NonParamDML (with the xgboost learners as my models). But it all works fine when I train an xgboost model directly.

fverac · 2024-09-05T20:39:23Z

This is late follow-up, but I spent some time digging into this and can offer a solution. Hopefully this might help others who run into this.

The crux of the problem is two-fold.

During fit-time, EconML converts input arrays (i.e. X, Y, T) to numpy objects (even if they were passed in as pandas dataframes).
a. See this line https://github.com/py-why/EconML/blob/main/econml/_ortho_learner.py#L744

This leads to a problem when we try to fit our XGboost model internally within EconML because:

XGBoost's enable_categorical feature only handles string features when the input matrices are dataframes (from e.g. pandas, Modin, or cuDF) and the string features are set to be of categorical type. In all other cases, you'll need to preprocess/encode your categorical features.
a. See XGBoost documentation for more details https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.DMatrix

So, your demo code works because we pass pandas dataframes directly to the xgboost's fit method. And it doesn't work in EconML because internally EconML passes numpy arrays to xgboost's fit method, which will give an error if the categorical features are passed as strings.

To demonstrate, your xgboost demo code throws an error if instead of passing pandas dataframes you pass the numpy values directly.

from econml.dml import NonParamDML
import numpy as np
import scipy
import pandas as pd
import xgboost as xgb
np.random.seed(123)

#numeric feats
X_num = pd.DataFrame(np.random.normal(size=(1000, 2)))

#categorical feat
X_cat = pd.Series(np.random.randint(0,3,1000))
X_cat=X_cat.replace({0:'zero',1:'one',2:'two'}).astype('category')
X=pd.concat([X_num,X_cat],axis=1)
X.columns=['num_feat_1','num_feat_2','cat_feat_1']

T = pd.Series(np.random.normal(size=(1000)))
y = pd.Series(np.random.randint(0,2,1000))


training_params={
    'objective': 'binary:logistic',
    'booster': 'gbtree',      
    'eval_metric': 'logloss',
    'enable_categorical':True,
    'max_cat_to_onehot':2,
    'learning_rate':0.1,
    'max_depth': 6
}

clf = xgb.XGBClassifier(**training_params)

# I change these lines to pass np arrays instead pf pandas dataframes, and I get an ValueError
clf.fit(X.values,y.values)
clf.predict_proba(X.values)

However, you can make it work again by

Passing the feature types directly to xgboost's .fit method, and
Ensuring the categorical variable is numerical but "label encoded". I achieve this by commenting out the line where you convert the categorical feature to a string.

from econml.dml import NonParamDML
import numpy as np
import scipy
import pandas as pd
import xgboost as xgb
np.random.seed(123)

#numeric feats
X_num = pd.DataFrame(np.random.normal(size=(1000, 2)))

#categorical feat
X_cat = pd.Series(np.random.randint(0,3,1000))

# do not run this line so that the categorical features remain as int's
# X_cat=X_cat.replace({0:'zero',1:'one',2:'two'}).astype('category')

X=pd.concat([X_num,X_cat],axis=1)
X.columns=['num_feat_1','num_feat_2','cat_feat_1']

T = pd.Series(np.random.normal(size=(1000)))
y = pd.Series(np.random.randint(0,2,1000))


training_params={
    'objective': 'binary:logistic',
    'booster': 'gbtree',      
    'eval_metric': 'logloss',
    'enable_categorical':True,
    'max_cat_to_onehot':2,
    'learning_rate':0.1,
    'max_depth': 6,
    'feature_types': ['q', 'q', 'c'] # specify which features are numerical vs categorical
}

clf = xgb.XGBClassifier(**training_params)

# Still passing np arrays. This works again.
clf.fit(X.values,y.values)
clf.predict_proba(X.values)

Finally, reapplying this back to your sample code for EconML:

training_params={
    'objective': 'binary:logistic',
    'booster': 'gbtree',      
    'eval_metric': 'logloss',
    'enable_categorical':True,
    'max_cat_to_onehot':2,
    'learning_rate':0.1,
    'max_depth': 6,
    'feature_types': ['q', 'q', 'c'] # explicitly label which features are categorical vs numerical
}

est = NonParamDML(
    model_y=xgb.XGBClassifier(**training_params),
    model_t=xgb.XGBRegressor(enable_categorical=True, feature_types=['q', 'q', 'c']),
    model_final=xgb.XGBRegressor(enable_categorical=True, feature_types=['q', 'q', 'c']),
    discrete_treatment=False,
    discrete_outcome=True,
    allow_missing=True,
)
est.fit(y, T, X=X, W=None)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XgBoost as nuisance model in DML learners (categorical features) #907

XgBoost as nuisance model in DML learners (categorical features) #907

erasedcitizen11 commented Aug 5, 2024 •

edited

Loading

kbattocchi commented Aug 7, 2024

erasedcitizen11 commented Aug 7, 2024 •

edited

Loading

fverac commented Sep 5, 2024

XgBoost as nuisance model in DML learners (categorical features) #907

XgBoost as nuisance model in DML learners (categorical features) #907

Comments

erasedcitizen11 commented Aug 5, 2024 • edited Loading

kbattocchi commented Aug 7, 2024

erasedcitizen11 commented Aug 7, 2024 • edited Loading

fverac commented Sep 5, 2024

erasedcitizen11 commented Aug 5, 2024 •

edited

Loading

erasedcitizen11 commented Aug 7, 2024 •

edited

Loading