Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try XGBoost. #167

Open
dchaplinsky opened this issue Aug 11, 2024 · 0 comments
Open

Try XGBoost. #167

dchaplinsky opened this issue Aug 11, 2024 · 0 comments

Comments

@dchaplinsky
Copy link

import xgboost as xgb
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    roc_auc_score,
)


def train_matcher(pairs_file: PathLike) -> None:
    pairs = []
    for pair in read_pairs(pairs_file):
        if pair.judgement == Judgement.UNSURE:
            pair.judgement = Judgement.NEGATIVE
        pairs.append(pair)

    positive = len([p for p in pairs if p.judgement == Judgement.POSITIVE])
    negative = len([p for p in pairs if p.judgement == Judgement.NEGATIVE])
    log.info("Total pairs loaded: %d (%d pos/%d neg)", len(pairs), positive, negative)

    X, y = pairs_to_arrays(pairs)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

    log.info("Training model with XGBoost...")

    model = xgb.XGBClassifier(use_label_encoder=False, eval_metric="logloss")
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    cnf_matrix = confusion_matrix(y_test, y_pred)
    print("Confusion matrix:\n", cnf_matrix)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("Recall:", recall_score(y_test, y_pred))

    y_pred_proba = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_pred_proba)
    print("Area under curve:", auc)
    model.save_model("/tmp/xgboost_v1.ubj")

Previous code for v1 gave:

Confusion matrix:
 [[ 19393  15738]
 [  7085 111833]]
Accuracy: 0.851845841258301
Precision: 0.8766334041435749
Recall: 0.9404211305269177
Area under curve: 0.8888057796734737

while the same features/data on xgboost gives this:

Confusion matrix:
 [[ 27154   8261]
 [  5642 112992]]
Accuracy: 0.9097494952904596
Precision: 0.9318697269345914
Recall: 0.9524419643609757
Area under curve: 0.9312480095583613
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant