Skip to content

Releases: oracle/tribuo

Tribuo v4.3.1

23 Dec 19:58
Compare
Choose a tag to compare

Small patch release to bump some dependencies and pull in minor fixes. The most notable fix allows CART trees to generate pure nodes, which previously they had been prevented from doing. This will likely improve the classification tree performance both for single trees and when used in an ensemble like RandomForests.

  • FeatureHasher should have an option to not hash the values, and TokenPipeline should default to not hashing the values (#309).
  • Improving the documentation for loading multi-label data with CSVLoader (#306).
  • Allows Example.densify to add arbitrary features (#304).
  • Adds accessors to ClassifierChainModel and IndependentMultiLabelModel so the individual models can be accessed (#302).
  • Allows CART trees to create pure leaves (#303).
  • Bumping jackson-core to 2.14.1, jackson-databind to 2.14.1, OpenCSV to 5.7.1 (pulling in the fixed commons-text 1.10.0).

Contributors

Tribuo v4.2.2

25 Oct 15:56
Compare
Choose a tag to compare

Small patch release to bump some dependencies and pull in minor fixes:

  • Validate hash salt during object creation (#237).
  • Fix XGBoost parameter overriding (#239).
  • Add some necessary accessors to TransformedModel (#244).
  • Bumping TF-Java to v0.4.2 (#281).
  • Fixes for test failures when running in a path with spaces in (#287).
  • Fix documentation links to the OCA.
  • Bumping jackson-core to 2.13.4, jackson-databind to 2.13.4.2, protobuf-java to 3.19.6, OpenCSV to 5.7.1 (pulling in the fixed commons-text 1.10.0).

Contributors

Tribuo v4.3.0

07 Oct 18:32
b8ba451
Compare
Choose a tag to compare

Tribuo v4.3 adds feature selection for classification problems, support for guided generation of model cards, and protobuf serialization for all serializable classes. In addition there is a new interface for distance based computations which can now use a kd-tree or brute force comparisons, the sparse linear model package has been rewritten to use Tribuo's linear algebra system improving the speed and reducing memory consumption, and we've added some more tutorials.

Note this is likely the last feature release of Tribuo to support Java 8. The next major version of Tribuo will require Java 17. In addition, support for using java.io.Serializable for serialization will be removed in the next major release, and Tribuo will exclusively use protobuf based serialization.

Feature Selection

In this release we've added support for feature selection algorithms to the dataset and provenance systems, along with implementations of 4 information theoretic feature selection algorithms for use in classification problems. The algorithms (MIM, CMIM, mRMR and JMI) are described in this review paper. Continuous inputs are discretised into a fixed number of equal width bins before the mutual information is computed. These algorithms are a useful feature selection baseline, and we welcome contributions to extend the set of supported algorithms.

  • Feature selection algorithms #254.

Model Card Support

Model Cards are a popular way of describing a model, its training data, expected applications and any use cases that should be avoided. In this release we've added guided generation of model cards, where many fields are automatically generated from the provenance information inside each Tribuo model. Fields which require user input (such as the expected use cases for a model, or its license) can be added via a CLI program, and the resulting model card can be saved in json format.

At the moment, the automatic data extraction fails on some kinds of nested ensemble models which are generated without using a Tribuo Trainer class, in the future we'll look at improving the data extraction for this case.

Protobuf Serialization

In this release we've added protocol buffer definitions for serializing all of Tribuo's serializable types, along with the necessary code to interact with those definitions. This effort has improved the validation of serialized data, and will allow Tribuo models to be upwards compatible across major versions of Tribuo. Any serialized model or dataset from Tribuo v4.2 or earlier can be loaded in and saved out into the new format which will ensure compatibility with the next major version of Tribuo.

  • Protobuf support for core types (#226, #255, #262, #264).
  • Protobuf support for models (Multinomial Naive Bayes #267, Sparse linear models #269, XGBoost #270, OCI, ONNX and TF #271, LibSVM #272, LibLinear #273, SGD #275, Clustering models #276, Baseline models and ensembles #277, Trees #278).
  • Docs and supporting programs (#279).

Smaller improvements

We added an interface for querying the nearest neighbours of a vector, and updated HDBSCAN, K-Means and K-NN to use the new interface. The old implementation has been renamed the "brute force" search operator, and a new implementation which uses a kd-tree has been added.

We migrated off Apache Commons Math, which necessitated adding several methods to Tribuo's math library. In the process we refactored the sparse linear model code, removing redundant matrix operations and greatly improving the speed of LASSO.

  • Refactor sparse linear models and remove Apache Commons Math (#241).

The ONNX export support has been refactored to allow the use of different ONNX opsets, and custom ONNX operations. This allows users of Tribuo's ONNX export support to supply their own operations, and increases the flexibility of the ONNX support on the JVM.

  • ONNX operator refactor (#245).

ONNX Runtime has been upgraded to v1.12.1, which includes Linux ARM64 and macOS ARM64 binaries. As a result we've removed the ONNX tests from the arm Maven profile, and so those tests will execute on Linux & macOS ARM64 platforms.

  • ONNX Runtime upgrade (#256).

Small improvements

  • Improved the assignment to the noise cluster in HDBSCAN (#222).
  • Upgrade liblinear-java to v2.44 (#228).
  • Added accessors for the HDBSCAN cluster exemplars (#229).
  • Improve validation of salts when hashing feature names (#237).
  • Added accessors to TransformedModel for the wrapped model (#244).
  • Added a regex text preprocessor (#247).
  • Upgrade OpenCSV to v5.6 (#259).
  • Added a builder to RowProcessor to make it less confusing (#263).
  • Upgrade TF-Java to v0.4.2 (#281).
  • Upgrade OCI Java SDK to v2.46.0, protobuf-java to 3.19.6, XGBoost to 1.6.2, jackson to 2.14.0-rc1 (#288).

Bug Fixes

  • Fix for HDBSCAN small cluster generation (#236).
  • XGBoost provenance capture (#239.

Contributors

Tribuo v4.2.1

04 May 16:19
Compare
Choose a tag to compare

Small patch release for three issues:

  • Ensure K-Means thread pools shut down when training completes (#224)
  • Fix issues where ONNX export of ensembles, K-Means initialization and several tests relied upon HashSet iteration order (#220,#225)
  • Upgrade to TF-Java 0.4.1 which includes an upgrade to TF 2.7.1 which brings in several fixes for native crashes operating on malformed or malicious models (#228)

OLCUT is updated to 5.2.1 to pull in updated versions of jackson & protobuf (#234). Also includes some docs and a small update for K-Means' toString (#209, #211, #212).

Contributors

Tribuo v4.2.0

20 Dec 22:09
1c594dc
Compare
Choose a tag to compare

Tribuo 4.2 adds new models, ONNX export for several types of models, a reproducibility framework for recreating Tribuo models, easy deployment of Tribuo models on Oracle Cloud, along with several smaller improvements and bug fixes. We've added more tutorials covering the new features along with multi-label classification, and further expanded the javadoc to cover all public methods.

In Tribuo 4.1.0 and earlier there is a severe bug in multi-dimensional regression models (i.e., regression tasks with multiple output dimensions). Models other than LinearSGDModel and SparseLinearModel (apart from when using the ElasticNetCDTrainer) have a bug in how the output dimension indices are constructed, and may produce incorrect outputs for all dimensions (as the output will be for a different dimension than the one named in the Regressor object). This has been fixed, and loading in models trained in earlier versions of Tribuo will patch the model to rearrange the dimensions appropriately. Unfortunately this fix cannot be applied to tree based models, and so all multi-output regression tree based models should be retrained using Tribuo 4.2 as they are irretrievably corrupt. Additionally when using standardization in multi-output regression LibSVM models dimensions past the first dimension have the model improperly stored and will also need to be retrained with Tribuo 4.2. See #177 for more details.

Note the KMeans implementation had several internal changes to support running with a java.lang.SecurityManager which will break any subclasses of KMeansTrainer. In most cases changing the signature of any overridden mStep method to match the new signature, and allowing the fjp argument to be null in single threaded execution will fix the subclass.

New models

In this release we've added Factorization Machines, Classifier Chains and HDBSCAN*. Factorization machines are a powerful non-linear predictor which uses a factorized approximation to learn a per output feature-feature interaction term in addition to a linear model. We've added Factorization Machines for multi-class classification, multi-label classification and regression. Classifier chains are an ensemble approach to multi-label classification which given a specific ordering of the labels learns a chain of classifiers where each classifier gets the features along with the predicted labels from earlier in the chain. We also added ensembles of randomly ordered classifier chains which work well in situations when the ground truth label ordering is unknown (i.e., most of the time). HDBSCAN is a hierarchical density based clustering algorithm which chooses the number of clusters based on properties of the data rather than as a hyperparameter. The Tribuo implementation can cluster a dataset, and then at prediction time it provides the cluster the given datapoint would be in without modifying the cluster structure.

  • Classifier Chains (#149), which also adds the jaccard score as a multi-label evaluation metric, and a multi-label voting combiner for use in multi-label ensembles.
  • Factorization machines (#179).
  • HDBSCAN (#196).

ONNX Export

The ONNX format is a cross-platform and cross-library model exchange format. Tribuo can already serve ONNX models via its ONNX Runtime interface, and now has the ability to export models in ONNX format for serving on edge devices, in cloud services, or in other languages like Python or C#.

In this release Tribuo supports exporting linear models (multi-class classification, multi-label classification and regression), sparse linear regression models, factorization machines (multi-class classification, multi-label classification and regression), LibLinear models (multi-class classification and regression), LibSVM models (multi-class classification and regression), along with ensembles of those models, including arbitrary levels of ensemble nesting. We plan to expand this coverage to more models over time, however for TensorFlow we recommend users export those models as a Saved Model and use the Python tf2onnx converter.

Tribuo models exported in ONNX format preserve their provenance information in a metadata field which is accessible when the ONNX model is loaded back into Tribuo. The provenance is stored as a protobuf so could be read from other libraries or platforms if necessary.

The ONNX export support is in a separate module with no dependencies, and could be used elsewhere on the JVM to support generating ONNX graphs. We welcome contributions to build out the ONNX support in that module.

  • ONNX export for LinearSGDModels (#154), which also adds a multi-label output transformer for scoring multi-label ONNX models.
  • ONNX export for SparseLinearModel (#163).
  • Add provenance to ONNX exported models (#182).
  • Refactor ONNX tensor creation (#187).
  • ONNX ensemble export support (#186).
  • ONNX export for LibSVM and LibLinear (#191).
  • Refactor ONNX support to improve type safety (#199).
  • Extract ONNX support into separate module (#TBD).

Reproducibility Framework

Tribuo has strong model metadata support via its provenance system which records how models, datasets and evaluations are created. In this release we enhance this support by adding a push-button reproduction framework which accepts either a model provenance or a model object and rebuilds the complete training pipeline, ensuring consistent usage of RNGs and other mutable state.

This allows Tribuo to easily rebuild models to see if updated datasets could change performance, or even if the model is actually reproducible (which may be required for regulatory reasons). Over time we hope to expand this support into a full experimental framework, allowing models to be rebuilt with hyperparameter or data changes as part of the data science process or for debugging models in production.

This framework was written by Joseph Wonsil and Prof. Margo Seltzer at the University of British Columbia as part of a collaboration between Prof. Seltzer and Oracle Labs. We're excited to continue working with Joe, Margo and the rest of the lab at UBC, as this is excellent work.

Note the reproducibility framework module requires Java 16 or greater, and is thus not included in the tribuo-all meta-module.

  • Reproducibility framework (#185, with minor changes in #189 and #190).

OCI Data Science Integration

Oracle Cloud Data Science is a platform for building and deploying models in Oracle Cloud. The model deployment functionality wraps a Python runtime and deploys them with an auto-scaler at a REST endpoint. In this release we've added support for deploying Tribuo models which are ONNX exportable directly to OCI DS, allowing scale-out deployments of models from the JVM. We also added a OCIModel wrapper which scores Tribuo Example objects using a deployed model's REST endpoint, allowing easy use of cloud resources for ML on the JVM.

  • Oracle Cloud Data Science integration (#200).

Small improvements

  • Date field processor and locale support in metadata extractors (#148)
  • Multi-output response processor allowing loading different formats of multi-label and multi-dimensional regression datasets (#150)
  • ARM dev profile for compiling Tribuo on ARM platforms (#152)
  • Refactor CSVLoader so it uses CSVDataSource and parses CSV files using RowProcessor, allowing an easy transition to more complex columnar extraction (#153)
  • Configurable anomaly demo data source (#160)
  • Configurable clustering demo data source (#161)
  • Configurable classification demo data source (#162)
  • Multi-Label tutorial and configurable multi-label demo data source (#166) (also adds a multi-label tutorial) plus fix in #168 after #167
  • Add javadoc for all public methods and fields (#175) (also fixes a bug in Util.vectorNorm)
  • Add hooks for model equality checks to trees and LibSVM models (#183) (also fixes a bug in liblinear get top features)
  • XGBoost 1.5.0 (#192)
  • TensorFlow Java 0.4.0 (#195) (note this changes Tribuo's TF API slightly as TF-Java 0.4.0 has a different method of initializing the session)
  • KMeans now uses dense vectors when appropriate, speeding up training (#201)
  • Documentation updates, ONNX and reproducibility tutorials ([#205](https:/...
Read more

Tribuo v4.1.1

10 Dec 18:22
Compare
Choose a tag to compare

This is the first patch release for Tribuo v4.1. The main fixes in this release are to the multi-dimensional output regression support, and to support the use of KMeans and KNN models when running under a restrictive SecurityManager. Additionally this release pulls in TensorFlow-Java 0.4.0 which upgrades the TensorFlow native library to 2.7.0 fixing several CVEs. Note those CVEs may not be applicable to TensorFlow-Java, as many of them relate to Python codepaths which are not included in TensorFlow-Java. Also note the TensorFlow upgrade is a breaking API change for Tribuo's TF API as graph initialization is handled differently in this release, which causes unavoidable changes.

Multi-dimensional Regression fix

In Tribuo 4.1.0 and earlier there is a severe bug in multi-dimensional regression models (i.e., regression tasks with multiple output dimensions). Models other than LinearSGDModel and SparseLinearModel (apart from when using the ElasticNetCDTrainer) have a bug in how the output dimension indices are constructed, and may produce incorrect outputs for all dimensions (as the output will be for a different dimension than the one named in the Regressor object). This has been fixed, and loading in models trained in earlier versions of Tribuo will patch the model to rearrange the dimensions appropriately. Unfortunately this fix cannot be applied to tree based models, and so all multi-output regression tree based models should be retrained using Tribuo 4.1.1 or newer as they are irretrievably corrupt. Additionally when using standardization in multi-output regression LibSVM models dimensions past the first dimension have the model improperly stored and will also need to be retrained with Tribuo 4.1.1 or newer. See #177 for more details.

Bug fixes

  • NPE fix for LIME explanations using models which don't support per class weights (#157).
  • Fixing a bug in multi-label evaluation which swapped FP for FN (#167).
  • Fixing LibSVM and LibLinear so they have reproducible behaviour (#172).
  • Provenance fix for TransformTrainer and an extra factory for XGBoostExternalModel so you can make them from an in memory booster (#176)
  • Fix multidimensional regression (#177) (fixes regression ids, fixes libsvm so it emits correct standardized models, adds support for per dimension feature weights in XGBoostRegressionModel).
  • Normalize LibSVMDataSource paths consistently in the provenance (#181).
  • KMeans and KNN now run correctly when using OpenSearch's SecurityManager (#197).
  • TensorFlow-Java 0.4.0 (#195).

Contributors

Full Changelog: v4.1.0...v4.1.1

Tribuo v4.1.0

26 May 18:40
Compare
Choose a tag to compare

Tribuo v4.1 Release Notes

Tribuo 4.1 is the first feature release after the initial open source release. We've added new models, new parameters for some models, improvements to data loading, documentation, transformations and the speed of our CRF and linear models, along with a large update to the TensorFlow interface. We've also revised the tutorials and added two new ones covering TensorFlow and document classification.

TensorFlow support

Migrated to TensorFlow Java 0.3.1 which allows specification and training of models in Java (#134). The TensorFlow models can be saved in two formats, either using TensorFlow's checkpoint format or Tribuo's native model serialization. They can also be exported as TensorFlow Saved Models for interop with other TensorFlow platforms. Tribuo can now load TF v2 Saved Models and serve them alongside TF v1 frozen graphs with it's external model loader.

We also added a TensorFlow tutorial which walks through the creation of a simple regression MLP, a classification MLP and a classification CNN, before exporting the model as a TensorFlow Saved Model and importing it back into Tribuo.

New models

  • Added extremely randomized trees, i.e., ExtraTrees (#51).
  • Added an SGD based linear model for multi-label classification (#106).
  • Added liblinear's linear SVM anomaly detector (#114).
  • Added arbitrary ensemble creation from existing models (#129).

New features

  • Added K-Means++ (#34).
  • Added XGBoost feature importance metrics (#52).
  • Added OffsetDateTimeExtractor to the columnar data package (#66).
  • Added an empty response processor for use with clustering datasets (#99).
  • Added IDFTransformation for generating TF-IDF features (#104).
  • Exposed more parameters for XGBoost models (#107).
  • Added a Wordpiece tokenizer (#111).
  • Added optional output standardisation to LibSVM regressors (#113).
  • Added a BERT feature extractor for text data (#116).
    This can load in ONNX format BERT (and BERT style) models from HuggingFace Transformers, and use them as part of Tribuo's text feature extraction package.
  • Added a configurable version of AggregateDataSource, and added iteration order parameters to both forms of AggregateDataSource (#125).
  • Added an option to RowProcessor which passes through newlines (#137).

Other improvements

  • Removed redundant computation in tree construction (#63).
  • Added better accessors for the centroids of a K-Means model (#98).
  • Improved the speed of the feature transformation infrastructure (#104).
  • Refactored the SGD models to reduce redundant code and allow models to share upcoming improvements (#106, #134).
  • Added many performance optimisations to the linear SGD and CRF models, allowing the automatic use of dense feature spaces (#112). This also adds specialisations to the math library for dense vectors and matrices, improving the performance of the CRF model even when operating on sparse feature sets.
  • Added provenance tracking of the Java version, OS and CPU architecture (#115).
  • Changed the behaviour of sparse features under transformations to expose additional behaviour (#122).
  • Improved MultiLabelEvaluation.toString() (#136).
  • Added a document classification tutorial which shows the various text feature extraction techniques available in Tribuo.
  • Expanded javadoc coverage.
  • Upgraded ONNX Runtime to 1.7.0, XGBoost to 1.4.1, TensorFlow to 0.3.1, liblinear-java to 2.43, OLCUT to 5.1.6, OpenCSV to 5.4.
  • Miscellaneous small bug fixes.

Contributors

Tribuo v4.0.2

05 Nov 17:45
Compare
Choose a tag to compare

This is the first Tribuo point release after the initial public announcement. It fixes many of the issues our early users have found, and improves the documentation in the areas flagged by those users. We also added a couple of small new methods as part of fixing the bugs, and added two new tutorials: one on columnar data loading and one on external model loading (i.e. XGBoost and ONNX models).

Bugs fixed:

  • Fixed a locale issue in the evaluation tests.
  • Fixed issues with RowProcessor (expand regexes not being called, improper provenance capture).
  • IDXDataSource now throws FileNotFoundException rather than a mysterious NullPointerException when it can't find the file.
  • Fixed issues in JsonDataSource (consistent exceptions thrown, proper termination of reading in several cases).
  • Fixed an issue where regression models couldn't be serialized due to a non-serializable lambda.
  • Fixed UTF-8 BOM issues in CSV loading.
  • Fixed an issue where LibSVMTrainer didn't track state between repeated calls to train.
  • Fixed issues in the evaluators to ensure consistent exception throwing when discovering unlabelled or unknown ground truth outputs.
  • Fixed a bug in ONNX LabelTransformer where it wouldn't read pytorch outputs properly.
  • Bumped to OLCUT 5.1.5 to fix a provenance -> configuration conversion issue.

New additions:

  • Added a method which converts a Jackson ObjectNode into a Map suitable for the RowProcessor.
  • Added missing serialization tests to all the models.
  • Added a getInnerModels method to LibSVMModel, LibLinearModel and XGBoostModel to allow users to access a copy of the internal models.
  • More documentation.
  • Columnar data loading tutorial.
  • External model (XGBoost & ONNX) tutorial.

Dependency updates:

  • OLCUT 5.1.5 (brings in jline 3.16.0 and jackson 2.11.3).

Tribuo v4.0.1

01 Sep 00:45
Compare
Choose a tag to compare
  • Fixes an issue where the CSVReader wouldn't read files with extraneous newlines at the end.
  • Adds an IDXDataSource so we can read IDX (i.e. MNIST) formatted datasets.
  • Updated the configuration tutorial to read MNIST from IDX files rather than libsvm files.

Tribuo v4.0.0 (Initial Public Release)

13 Aug 15:59
Compare
Choose a tag to compare

This is the first public release of the Tribuo Java Machine Learning library. Tribuo provides classification, regression, clustering and anomaly detection algorithms along with data loading, transformation and model evaluation code. Tribuo also provides support for loading external ONNX models and scoring them in Java as well as support for training and evaluating deep learning models using TensorFlow.

Tribuo's development started in 2016 led by Oracle Labs' Machine Learning Research Group, and has been in production inside Oracle since 2017. It's now available under an Apache 2.0 license, and we'll continue to develop it in the open, including accepting community PRs under the Oracle Contributor Agreement.

See tribuo.org for a project overview, or explore the docs here on Github for more details. We have jupyter notebook based tutorials demonstrating various features of the library.