Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XgBoost as nuisance model in DML learners (categorical features) #907

Open
erasedcitizen11 opened this issue Aug 5, 2024 · 3 comments
Open

Comments

@erasedcitizen11
Copy link

erasedcitizen11 commented Aug 5, 2024

Hi there,

I'm looking to use xgboost as my nuisance model in my DoubleML setup and use xgboost's own mechanism for encoding categorical features (rather than having to one hot encode them myself).

I can do this quite easily in xgboost using the scikit-learn interface:

from econml.dml import NonParamDML
import numpy as np
import scipy
import pandas as pd
import xgboost as xgb
np.random.seed(123)

#numeric feats
X_num = pd.DataFrame(np.random.normal(size=(1000, 2)))

#categorical feat
X_cat = pd.Series(np.random.randint(0,3,1000))
X_cat=X_cat.replace({0:'zero',1:'one',2:'two'}).astype('category')
X=pd.concat([X_num,X_cat],axis=1)
X.columns=['num_feat_1','num_feat_2','cat_feat_1']

T = pd.Series(np.random.normal(size=(1000)))
y = pd.Series(np.random.randint(0,2,1000))


training_params={
    'objective': 'binary:logistic',
    'booster': 'gbtree',      
    'eval_metric': 'logloss',
    'enable_categorical':True,
    'max_cat_to_onehot':2,
    'learning_rate':0.1,
    'max_depth': 6
}

clf = xgb.XGBClassifier(**training_params)
clf.fit(X,y)
clf.predict_proba(X)

However, using the DML classes results in an error. The param that allows us to use xgboost's encoding of categorical features is enable_categorical=True but it appears econml is not able to pass this information onto xgboost.

Is there a way to get around this? I have some high cardinality categorical features that I would rather not have to one hot encode, hence why I have used xgboost as my nuisance model.

est = NonParamDML(
    model_y=xgb.XGBClassifier(**training_params),
    model_t=xgb.XGBRegressor(),
    model_final=xgb.XGBRegressor(),
    discrete_treatment=False,
    discrete_outcome=True,
    allow_missing=True,
)
est.fit(y, T, X=X, W=None)
@kbattocchi
Copy link
Collaborator

What error are you getting? We should be using the model as-is however you instantiate it.

One issue that you might run into is that we are internally doing cross-fitting with the model, so if you have rare values in that column it's possible that when we fit an instance on a subset of the data we don't see one of those values but when we predict on the held-out set we do, leading to an error. If that's the issue, then you can provide your own cross-fitting folds through the cv argument which stratify on the values of that column and the outcome, rather than using the default (which will otherwise stratify only on the outcome since that's discrete).

@erasedcitizen11
Copy link
Author

erasedcitizen11 commented Aug 7, 2024

Thanks for the response. In the example above, I get:
ValueError: could not convert string to float: 'zero'

So xgboost thinks this feature is numeric when it is categorical when I call NonParamDML (with the xgboost learners as my models). But it all works fine when I train an xgboost model directly.

@fverac
Copy link
Collaborator

fverac commented Sep 5, 2024

This is late follow-up, but I spent some time digging into this and can offer a solution. Hopefully this might help others who run into this.

The crux of the problem is two-fold.

  1. During fit-time, EconML converts input arrays (i.e. X, Y, T) to numpy objects (even if they were passed in as pandas dataframes).
    a. See this line https://github.com/py-why/EconML/blob/main/econml/_ortho_learner.py#L744

This leads to a problem when we try to fit our XGboost model internally within EconML because:

  1. XGBoost's enable_categorical feature only handles string features when the input matrices are dataframes (from e.g. pandas, Modin, or cuDF) and the string features are set to be of categorical type. In all other cases, you'll need to preprocess/encode your categorical features.
    a. See XGBoost documentation for more details https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.DMatrix

So, your demo code works because we pass pandas dataframes directly to the xgboost's fit method. And it doesn't work in EconML because internally EconML passes numpy arrays to xgboost's fit method, which will give an error if the categorical features are passed as strings.

To demonstrate, your xgboost demo code throws an error if instead of passing pandas dataframes you pass the numpy values directly.

from econml.dml import NonParamDML
import numpy as np
import scipy
import pandas as pd
import xgboost as xgb
np.random.seed(123)

#numeric feats
X_num = pd.DataFrame(np.random.normal(size=(1000, 2)))

#categorical feat
X_cat = pd.Series(np.random.randint(0,3,1000))
X_cat=X_cat.replace({0:'zero',1:'one',2:'two'}).astype('category')
X=pd.concat([X_num,X_cat],axis=1)
X.columns=['num_feat_1','num_feat_2','cat_feat_1']

T = pd.Series(np.random.normal(size=(1000)))
y = pd.Series(np.random.randint(0,2,1000))


training_params={
    'objective': 'binary:logistic',
    'booster': 'gbtree',      
    'eval_metric': 'logloss',
    'enable_categorical':True,
    'max_cat_to_onehot':2,
    'learning_rate':0.1,
    'max_depth': 6
}

clf = xgb.XGBClassifier(**training_params)

# I change these lines to pass np arrays instead pf pandas dataframes, and I get an ValueError
clf.fit(X.values,y.values)
clf.predict_proba(X.values)

However, you can make it work again by

  1. Passing the feature types directly to xgboost's .fit method, and
  2. Ensuring the categorical variable is numerical but "label encoded". I achieve this by commenting out the line where you convert the categorical feature to a string.
from econml.dml import NonParamDML
import numpy as np
import scipy
import pandas as pd
import xgboost as xgb
np.random.seed(123)

#numeric feats
X_num = pd.DataFrame(np.random.normal(size=(1000, 2)))

#categorical feat
X_cat = pd.Series(np.random.randint(0,3,1000))

# do not run this line so that the categorical features remain as int's
# X_cat=X_cat.replace({0:'zero',1:'one',2:'two'}).astype('category')

X=pd.concat([X_num,X_cat],axis=1)
X.columns=['num_feat_1','num_feat_2','cat_feat_1']

T = pd.Series(np.random.normal(size=(1000)))
y = pd.Series(np.random.randint(0,2,1000))


training_params={
    'objective': 'binary:logistic',
    'booster': 'gbtree',      
    'eval_metric': 'logloss',
    'enable_categorical':True,
    'max_cat_to_onehot':2,
    'learning_rate':0.1,
    'max_depth': 6,
    'feature_types': ['q', 'q', 'c'] # specify which features are numerical vs categorical
}

clf = xgb.XGBClassifier(**training_params)

# Still passing np arrays. This works again.
clf.fit(X.values,y.values)
clf.predict_proba(X.values)

Finally, reapplying this back to your sample code for EconML:

training_params={
    'objective': 'binary:logistic',
    'booster': 'gbtree',      
    'eval_metric': 'logloss',
    'enable_categorical':True,
    'max_cat_to_onehot':2,
    'learning_rate':0.1,
    'max_depth': 6,
    'feature_types': ['q', 'q', 'c'] # explicitly label which features are categorical vs numerical
}

est = NonParamDML(
    model_y=xgb.XGBClassifier(**training_params),
    model_t=xgb.XGBRegressor(enable_categorical=True, feature_types=['q', 'q', 'c']),
    model_final=xgb.XGBRegressor(enable_categorical=True, feature_types=['q', 'q', 'c']),
    discrete_treatment=False,
    discrete_outcome=True,
    allow_missing=True,
)
est.fit(y, T, X=X, W=None)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants