Skip to content

Conversation

@YuzeHao2023
Copy link

  • Implement LightGBMClassifier in sml/ensemble/lightgbm.py

    • Supports gradient boosting with decision trees
    • Parameters: n_estimators, learning_rate, max_depth, num_leaves, criterion, epsilon
    • Methods: fit(), predict(), decision_function()
    • Consistent with AdaBoost and RandomForest implementations
  • Add comprehensive tests in tests/ensemble/lightgbm_test.py

    • Test against sklearn GradientBoostingClassifier
    • Uses Iris dataset with boolean feature transformation
  • Add emulation tests in emulations/ensemble/lightgbm_emul.py

    • Performance benchmarking with sklearn
    • SPU execution comparison
  • Update ensemble init.py to export LightGBMClassifier

- Implement LightGBMClassifier in sml/ensemble/lightgbm.py
  - Supports gradient boosting with decision trees
  - Parameters: n_estimators, learning_rate, max_depth, num_leaves, criterion, epsilon
  - Methods: fit(), predict(), decision_function()
  - Consistent with AdaBoost and RandomForest implementations

- Add comprehensive tests in tests/ensemble/lightgbm_test.py
  - Test against sklearn GradientBoostingClassifier
  - Uses Iris dataset with boolean feature transformation

- Add emulation tests in emulations/ensemble/lightgbm_emul.py
  - Performance benchmarking with sklearn
  - SPU execution comparison

- Update ensemble __init__.py to export LightGBMClassifier
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @YuzeHao2023, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new LightGBM classifier implementation, expanding the machine learning capabilities of the sml library. It provides a robust, configurable gradient boosting model, complete with thorough testing and emulation to ensure its correctness and performance, particularly within secure computation environments.

Highlights

  • LightGBM Classifier Implementation: A new LightGBMClassifier has been implemented in sml/ensemble/lightgbm.py, supporting gradient boosting with decision trees and configurable parameters like n_estimators, learning_rate, max_depth, num_leaves, criterion, and epsilon.
  • Comprehensive Testing: Dedicated unit tests have been added in tests/ensemble/lightgbm_test.py to validate the LightGBMClassifier against scikit-learn's GradientBoostingClassifier using the Iris dataset.
  • Emulation Testing: Emulation tests are introduced in emulations/ensemble/lightgbm_emul.py to benchmark performance against scikit-learn and compare SPU execution, ensuring compatibility and efficiency in a secure computation environment.
  • Module Export: The new LightGBMClassifier is now exported from sml/ensemble/__init__.py, making it accessible within the sml.ensemble module.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new LightGBMClassifier, along with corresponding tests and emulation scripts. My review focuses on the correctness of the implementation, its maintainability, and the quality of the tests. The main issue is that the classifier's name LightGBMClassifier is misleading, as the implementation appears to be a variant of AdaBoost (SAMME.R) rather than the LightGBM algorithm. I've also identified an unused parameter in the classifier's constructor and noted that the tests lack assertions to verify correctness. Additionally, there's some duplicated code between the test and emulation files that could be refactored. Please see my detailed comments for suggestions on how to address these points.

from sml.tree.tree import DecisionTreeClassifier as sml_dtc


class LightGBMClassifier:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The class is named LightGBMClassifier, but the implementation follows the AdaBoost.SAMME.R algorithm, not LightGBM or even standard gradient boosting. The weight update rule in _boost_round (lines 220-226) is characteristic of AdaBoost. This name is misleading and can cause confusion for users. Please rename the class to something that accurately reflects the algorithm, such as SAMMEClassifier.

Comment on lines +38 to +40
num_leaves : int
The maximum number of leaves in each tree.
Must be greater than 1. Default is 31.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The num_leaves parameter is defined in the docstring and accepted in __init__, but it is not used anywhere in the implementation. The DecisionTreeClassifier is instantiated using max_depth only. This can be misleading for users of the class. Please remove the num_leaves parameter from the docstring, the __init__ method signature, its validation, and its assignment to self.num_leaves.

Comment on lines +94 to +95
print(f"Accuracy in SKlearn: {score_plain}")
print(f"Accuracy in SPU: {score_encrypted}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test calculates accuracy scores but only prints them. A unit test must contain assertions to automatically verify the correctness of the implementation. Please add an assertion to check if the score from the SPU implementation is close to the score from the sklearn implementation. For example: assert jnp.isclose(score_plain, score_encrypted, atol=0.05).

Comment on lines +51 to +64
def load_data():
iris = load_iris()
iris_data, iris_label = jnp.array(iris.data), jnp.array(iris.target)
# sorted_features: n_samples * n_features_in
n_samples, n_features_in = iris_data.shape
sorted_features = jnp.sort(iris_data, axis=0)
new_threshold = (sorted_features[:-1, :] + sorted_features[1:, :]) / 2
new_features = jnp.greater_equal(
iris_data[:, :], new_threshold[:, jnp.newaxis, :]
)
new_features = new_features.transpose([1, 0, 2]).reshape(n_samples, -1)

X, y = new_features[:, ::3], iris_label[:]
return X, y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This load_data function is identical to the one in tests/ensemble/lightgbm_test.py. To avoid code duplication and improve maintainability, consider extracting this function into a shared utility module and importing it in both files.

# limitations under the License.


import copy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The copy module is imported but is not used within the file. Please remove this unused import.

Comment on lines +268 to +281
pred = sum(
jnp.where(
(estimator.predict(X) == classes).T,
w,
-1 / (n_classes - 1) * w,
)
* flag
for estimator, w, flag in zip(
self.estimators_,
self.estimator_weight_,
self.estimator_flags_,
strict=True,
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The decision_function uses a Python sum() on a generator expression. This pattern is not compatible with JAX's JIT compilation and can be inefficient for a large number of estimators due to Python-level iteration. For better performance and to enable JIT compilation, consider refactoring this to use jax.lax.fori_loop to accumulate the predictions.

Comment on lines +50 to +63
def load_data():
iris = load_iris()
iris_data, iris_label = jnp.array(iris.data), jnp.array(iris.target)
# sorted_features: n_samples * n_features_in
n_samples, n_features_in = iris_data.shape
sorted_features = jnp.sort(iris_data, axis=0)
new_threshold = (sorted_features[:-1, :] + sorted_features[1:, :]) / 2
new_features = jnp.greater_equal(
iris_data[:, :], new_threshold[:, jnp.newaxis, :]
)
new_features = new_features.transpose([1, 0, 2]).reshape(n_samples, -1)

X, y = new_features[:, ::3], iris_label[:]
return X, y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This load_data function is identical to the one in emulations/ensemble/lightgbm_emul.py. To avoid code duplication and improve maintainability, please extract this function into a shared utility module and import it in both files.

sim = spsim.Simulator.simple(3, libspu.ProtocolKind.ABY3, libspu.FieldType.FM64)

X, y = load_data()
n_samples, n_features = X.shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variables n_samples and n_features are assigned but never used. They should be removed to improve code clarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant