Releases: ludwig-ai/ludwig
v0.4: Distributed processing and training with Ray and Dask, Distributed hyperopt with RayTune, TabNet, Remote FS, MLflow for monitoring and serving, new Datasets
Changelog
Additions
- Integrate ray tune into hyperopt (#1001)
- Added Ames Housing Kaggle dataset (#1098)
- Added functionality to obtain subtrees in the SST dataset (#1108)
- Added comparator combiner (#1113)
- Additional Text Classification Datasets (#1121)
- Added Ray remote backend and Dask distributed preprocessing (#1090)
- Added TabNet combiner and needed modules (#1062)
- Added Higgs Boson dataset (#1157)
- Added GitHub workflow to push to Docker Hub (#1160)
- Added more tagging schemes for Docker images (#1161)
- Added Docker build matrix (#1162)
- Added category feature > 1 dim to TabNet (#1150)
- Added timeseries datasets (#1149)
- Add TabNet Datasets (#1153)
- Forest Cover Type, Adult Census Income and Rossmann Store Sales datasets (#1165)
- Added KDD Cup 2009 datasets (#1167)
- Added Ray GPU image (#1170)
- Added support for cloud object storage (S3, GCS, ADLS, etc.) (#1164)
- Perform inference with Dask when using the Ray backend (#1128)
- Added schema validation to config files (#1186)
- Added MLflow experiment tracking support (#1191)
- Added export to MLflow pyfunc model format (#1192)
- Added MLP-Mixer image encoder (#1178)
- Added TransformerCombiner (#1177)
- Added TFRecord support as a preprocessing cache format (#1194)
- Added higgs boson tabnet examples (#1209)
Improvements
- Abstracted Horovod params into the Backend API (#1080)
- Added allowed_origins to serving to support to allow cross-origin requests (#1091)
- Added callbacks to hook into the training loop programmatically (#1094)
- Added scheduler support to Ray Tune hyperopt and fixed GPU usage (#1088)
- Ray Tune: enforced that epochs equals max_t and early stopping is disabled (#1109)
- Added register_trainable logic to RayTuneExecutor (#1117)
- Replaced Travis CI with GitHub Actions (#1120)
- Split distributed tests into separate test suite (#1126)
- Removed unused regularizer parameter from training defaults
- Restrict docker built GA to only ludwig-ai repos (#1166)
- Harmonize return object for categorical, sequence generator and sequence tagger (#1171)
- Sourcing images from either file path or in-memory ndarrays (#1174)
- Refactored hyperopt results into object structure for easier programmatic usage (#1184)
- Refactored all contrib classes to use the Callback interface (#1187)
- Improved performance of Dask preprocessing by adding parallelism (#1193)
- Improved TabNetCombiner and Concat combiner (#1177)
- Added additional backend configuration options (#1195)
- Made should_shuffle configurable in Trainer (#1198)
Bugfixes
- Fix SST parentheses issue
- Fix serve.py adding a try around the form parsing (#1111)
- Fix #1104: add lengths to text encoder output with updated unit test (#1105)
- Fix sst2 substree logic to match glue sst2 dataset (#1112)
- Fix #1078: Avoid recreating cache when using image preproc (#1114)
- Fix checking is dask exists in figure_data_format_dataset
- Fixed bug in EthosBinary dataset class and model directory copying logic in RayTuneReportCallback (#1129)
- Fix #1070: error when saving model with image feature (#1119)
- Fixed IterableBatcher incompatibility with ParquetDataset and remote model serialization (#1138)
- Fix: passing backend and TF config parameters to model load path in experiment
- Fix: improved TabNet numerical stability + refactoring
- Fix #1147: passing bn_epsilon to AttentiveTransformer initialization in TabNet
- Fix #1093: loss value mismatch (#1103)
- Fixed CacheManager to correctly handle test_set and validation_set (#1189)
- Fixing TabNet sparsity loss issue (#1199)
Breaking changes
Most models trained with v0.3.3 would keep working in v0.4.
The main changes in v0.4 are additional options, so what worked previously should not be broken now.
One exception to this is that now there is a much strictier check of the validity of the model configuration.
This is great as it allows to catch errors earlier, although configurations that despite errors worked in the past may not work anymore.
The checks should help identify the issues in the configurations though, so errors should be easily ficable.
Contributors
@tgaddair @jimthompson5802 @ANarayan @kaushikb11 @mejackreed @ronaldyang @zhisbug @nimz @kanishk16
v0.3.3: New datasets, dependency fixes
v0.3.2: New datasets, better processing of binary and numerical, minor fixes
Changelog
Additions
- Added feature identification logic (#957)
- Added Backend interface for abstracting DataFrame preprocessing steps (#1014)
- Add support for transforming numeric predictions that were normalized (#1015)
- Added Kaggle API integration and Titanic dataset (#1021)
- Add Korean translation for the README (#1022)
- Added cast_columns function to preprocessing and cast_column function to all feature mixin classes (#1027)
- Added custom encoder / decoder registration decorator (#1017)
- Add titles to Hyperopt Report visualization (#1026)
-Added cast_columns function to preprocessing and cast_column function to all feature mixin classes (#1027) - Added label-wise probability to binary feature predictions (#1033)
- Add support for num_layers in sequence generator decoder (#1050)
- Added Flickr8k dataset (#1053)
- Add support for transforming numeric predictions that were normalized (#1015)
Improvements
- Improved triggering of cache re-creation (now it depends also on changes in feature types)
- Improved legend and add tight_layout param to compare predictions plot (#1037)
- Improved postprocessing for binary features so prediction vocab matches inputs (#1038)
- Bump TensorFlow and tfa-nightly for 2.4.0 release (#1058)
- Updated Dockerfiles to TensorFlow 2.4.0 (#1059)
Bugfixes
- Fix missing yaml files for datasets in pip package
- Fix hdf5 preprocessing error
- Fix calculation of the metric score for hyperopt (#1031)
- Fix wrong argument in visualize.py from
-f
to-ofn
(#1032) - Fix fill NaN by adding selected conversion of columns to string when computing metadata (#1042)
- Fix: inconsistent seq length for probabilities (#1043)
- Fix issues with changes in xlrd package (#1056)
v0.3.1: Datasets, cache checksum, improvements for text and visualization
Additions
- Added dataset module (#949) containing MNIST, SST-2, SST-5, REUTERS, OHSUMED, FEVER and GoEmotions datasets
- Add Ludwig Model Serve Example (#947)
- Add checksum mechanism for HDF5 and Meta JSON cache file (#1006)-
Improvements
- Updated run_experiment to use new skip parameters and returns (#955)
- Several improvements to testing (more coverage, with faster tests)
- Changed default value of HF encoder trainable parameter to True (for performance reasons) (#996)
- Improved and slightly modified visualization functions API-
Bugfixes
- Changed not to is None in dataset checks in hyperopt.run.hyperopt() (#956)
- Fix LudwigModel.predict() when skip_save_predictions = False (#962)
- Fix #963: Convert materialized tensors to numpy arrays up front to avoid repeated conversion ()
- Fix errors with DataFrame truth checks in hyperopt (#956)
- Added truncation to HF tokenizer (#978)
- Reimplemented Jaccard Metric for the Set Feature (#979)
- Fix learning rate computation with decay and warmup (#982)
- Fix CLI logger typos (#998, #999)
- Fix loading of split from hdf5 (#1003)
- Fix visualization unit tests (#981)
- Fix concatenate_csv to work with arbitrary read functions and renamed concatenate_datasets
- Fix compatibility issue with matplotlib 3.3.3
- Limit numpy and h5py max versions due to tensorflow 2.3.1 max supported versions (#990)
- Fixed usage of model_load_path with Horovod (#1011)
v0.3: TensorFlow 2, Hyperparameter optimization, Hugging Face Transformers integration, new data formats and more
Improvements
- Full porting to TensorFlow 2.
- New hyperparameter optimization functionality through the
hyperopt
command. - Integration with HuggingFace Transformers for pre-trained text encoders.
- Refactored preprocessing with new supported data formats:
auto
,csv
,df
,dict
,excel
,feather
,fwf
,hdf5
(cache file produced during previous training),html
(file containing a single HTML<table>
),json
,jsonl
,parquet
,pickle
(pickled Pandas DataFrame),sas
,spss
,stata
,tsv
. - improved validation logic.
- New Transformer encoders for sequential data types (sequence, text, audio, timeseries).
- new
batch_predict
functionality in the REST API. - New export command to export to SavedModel and Neuropod.
- New
collect_summary
command to print out a model summary with layers names. - Modified the
predict
command, and splitt it intopredict
andevaluate
. The first only produces predictions, the second evaluates those predictions against ground truth. - Two new hyperopt-related visualizations:
hyperopt_report
andhyperopt_hiplot
. - Improved tracking of metrics in the TensorBoard.
- Greatly improved test suite.
- Various documentation improvements.
Bugfixes
This release includes a fundamental rewrite of the internals, so many bugs have been fixed while rewiting.
This list includes only the ones that have a specific Issue associated with them, but many others where addressed.
- Fix #649: Replaced SPLIT with 'split' in example code.
- Fix documentation, wrong parameter name (#684)
- Fix #702: Fixed setting defaults in binary output feature.
- Fix #729: Reduce output was not passed to the sequence encoder inside the sequence combiner.
- Fix #742: Renamed self._learning_rate in Progresstracker.
- Fix #799: Added tf_version to description.json.
- Fix #840: Better messaging for plateau logic.
- Fix #850: Switch from ValueError to Warning to make stratify work on non-output features.
- Fix ##844: Load LudwigModel in test_savedmodel before creating saved model.
- Fix #833: loads the model after training and before predicting if the model was saved on disk.
- Fix #933: Added NumpyDecoder before returning JSON response from server.
- Fix #935: Multiple categorical features with different vocabs now work.
Breaking changes
Because of the change in the underlying tensor computation library (TensorFlow 1 to TensorFlow 2) and the internal reworking it required, models trained with v0.2 don't work on v0.3.
We suggest to retrain such models, in most cases the same model definition can be used, although one impactuful breaking change is that model_definition
are now called config
, because they don't contain only information about the model, but also training, preprocessing, and a newly added hyperopt section.
There have been some changes in the parameters inside the config too.
In particular, one main change is dropout
that now it is a float value that can be specified for each encode / combiner / decoder / layer, while before it was a boolean parameter.
As a consequence, the dropout_rate
parameter in the training section has been removed.
Another change in training parameters are the available optimizers.
TensorFlow 2 doesn't ship with some of the ones that were exposed in Ludwig (adagradda
, proximalgd
, proximaladagrad
) and the momentum optimizer has been removed as now it is a parameter of the sgd
optimizer.
Newly added optimizers are nadam
and adamax
.
Note that the accuracy
metric for the combined
feature has been removed because it was misleading in some scenarios when multiple features of different types where trained.
In most cases, encoders, combiners and decoders now have an increased number of exposed parameters to play with for increased flexibility.
One notable change is that the previous BERT encoder has been replaced by an HuggingFace based one with different parameters, and it is now available only for text features.
Please refer to the User Guide for details for each encoder.
Tokenizers also changed substantially with new parameters supported, refer to User Guide for more details.
Other major changes are related to the CLI interface.
The predict
command has been replaced in functionality with a simplified predict
and a new evaluate
. The first only produces predictions, the second evaluates those predictions against ground truth.
Some parameters of all CLI commands changed.
All different type of data_*
parameters have been replaced by generic dataset
, training_set
, validation_set
and test_set
parameters, while the data format is automatically determined, but can also be set manually by using the data_format
argument. There is no
gpu_fractionany more, but now users can specify
gpu_limit` for managing the VRAM usage.
For all additional minor changes to the CLI please refer to the User Guide.
The programmatic API changed too, as a consequence.
Now all the parameters match closely the ones of the CLI interface, including the new dataset
and gpu
parameters.
Also in this case the predict
function has been split into predict
and evaluate
.
Finally, the returned values of most functions changed to include some intermediate processing values, like for instance the preprocessed and split data when calling train
, the output experiment directory and so on.
Notably, now there is an experiment
function in the API too, together with a new hyperopt
one.
For more datails, refer to the API reference.
Contriburotrs
@jimthompson5802 @tgaddair @kaushikb11 @ANarayan @calio @dme65 @ydudin3 @carlogrisetti @ifokeev @flozi00 @soovam123 @KushalP1 @JiByungKyu @stremlau @adiov @martinremy @dsblank @jakobt @vkuzmin-uber @mbzhu1 @moritzebeling @lnxpy
v0.2.2: WandB, K-Fold cross validation, better tracking of measures, and many bugfixes.
Improvements
Added integration with Weights and Biases.
Added K-Fold cross validation.
Added 4 examples with their respective code and Jupyter Notebooks: Hyper-parameter optimization, K-Fold Cross Validation, MNIST, Titanic.
Greatly improved the measures tracked on the TensorBoard.
Added auto-detect function for field separator when reading CSVs.
Added CI tooling.
Class weights can be specified as a dictionary #615.
Removed deprecation warning from h5py.
Removed most deprecation warning from TensorFlow.
Bypass multiprocessing.Pool.map for faster execution.
Updated TensorFlow dependency to 1.15.2.
Various documentation improvements.
Bugfixes
Fix cudnn error on RTX GPUs.
Fix inverted confusion_matrix axis.
Fix #201: Removed whitespace as a separator option.
Fix #540: Fixed default text parameters for sampled loss.
Fix #541: Docker image improvements (removed libgmp and spacy model download).
Fix #554: Fix audio input test case in docker container.
Fix #570: Temporary dolution for in_memory
flag usage in API.
Fix #574: Setting intra and inter op parallelism to 0 so that TF determine them automatically.
Fix #329 and #575: Fixed use of SavedModel and added an integration test.
Fix #609: When predicting, if a split is in the CSV, data is split correctly.
Fix #616: Change preprocessing in siamese network example.
Fix #620: Failure in unit tests for 1 vs all calibration plots.
Fix #632: Setting minimum version requirements for six
.
Fix #636: CLI output table column ordering preserved when resuming.
Fix #641: Added multi-task learning section specifying the weight for each output feature in the User Guide.
Fix #642: Fixing horovod use when loading a model as initialization.
Contriburotrs
@jimthompson5802 @calz1 @pingsutw @vanpelt @carlogrisetti @anttisaukko @dsblank @borisdayma @flozi00 @jshah02
v0.2.1: Vector features, Norwegian and Lithuanian tokenizers, many bugfixes.
Improvements
Add Filter Bank features for audio.
Added two more parameters skip_save_test_predictions
and skip_save_test_statistics
to train and experiment CLI commands and API.
Updated to spaCy 2.2 with support for Norvegian and Lithuanian tokenizers.
Reorganized dependencies, now the defaults are barebone and there are several axtra ones.
Added fc_layers
to H3 embed encoder.
Added get_preprocessing_params
in preprocessing.
Refactored image features preprocessing to use multiprocessing.
Refactored preprocessing with strategy pattern.
Bugfixes
Fix #452: Removed dependency on gmpy
.
Fix #465: Adds capability to set the vocabulary from a Glove file.
Fix #480: Adds a health check to ludwig serve
.
Fix #481: Added some examples of visualization commands.
Fix #491: Improved skip parameters, now no directories are created if not needed.
Fix #492: Adds skip saving unprocessed output api.py
.
Fix #493: Added parameters for the vocabulary file and the UNK
and PAD
symbols in sequence feature call to create_vocabulary
in the calculation of metadata.
Fix #500: Fixed learning_curves()
when the training statistics file does not contain validation.
Fix #509: Fixes in_memory
issues in image features.
Fix #525: Adding check is_on_master()
before creating save_path
dir./ectory
Fix #510: Fixed version of pydantic.
Fix #532: Improved speed of add_sequence_feature_column()
.
Potentially breaking changes
Fix #520: Renamed field parameter in visualization to output_feature_name for clarity and improved documentation. Please make sure to rename you function calls if you were using this parameter by name (the order keeps the same).
Contributors
@sriki18 @carlogrisetti @areeves87 @naresh-bhandari @revolunet @patrickvonplaten @Athanaziz @dsblank @tgaddair @Mechachleopteryx @AlexeyGy @yu-iskw
v0.2: BERT, Audio / Speech, geospacial and temporal features, Visualization API, Server and improved Comet.ml integration
Improvements
New BERT encoder and with its BPE tokenizer
Added Audio features that can be used also for speech data (with appropriate preprocessing feature extraction)
Added H3 feature, together with 3 encoders to deal with spatial information
Added Date feature and two encoders to deal with temporal information
Improved Comet.ml integration
Refactored visualization.py
to make individual functions usable from API
Added capability of saving visualization graph in the visualization command and visualizations_utils.py
Added a serve
command that allows for spawning a prediction server using FastAPI
Added a test
command (that requires output columns in the data) to avoid confusion with predict
(which does not require output columns)
Added pixel normalization and pixel standardization scaling options for image features
Added greyscaling of images if specified channels = 1 and img channels is 3 or 4
Added normalization strategies for numerical features (#367)
Added experiment name parameter in the API (#357)
Refactored text tokenizers
Several improvements in logging
Added a method for saving models with SavedModels
in model.py
and exposes it in the API with a save_for_serving()
function (#329)(#425)
Upgraded to the latest version of TensorFlow 1.14 (#429)
Added learning rate warmup for non distributed settings
Bugfixes
Fix #321: Removed the 6n+2 check for ResNet size
Fix #328: adds missing UPDATE_OPS to the optimization operation
Fix #336: GloVe embeddings loading now reads utf-8 encoded files
Fix #336: Addresses the malformed lines issue in embeddings loading
Fix #346: added a parameter indicating if the session should be closed after training in full_train
Fix #351: values in categorical columns are now stripped before being compared to the vocabulary
Fix #364: associate the right function to non english text format functions
Fix #372: set evaluate performance parameter to false in predict.py
Fix #394: Improved error explaination when image dimensions don't match and improved documentation accordingly
Fix #411: Images in HDF5 are now correctly saved as uint8
instead of int8
Fix #431: missing libgmp3-dev dependency in docker (#428)
Fix fixed image resizing
Fix model load path (#424)
Fix batch norm in convolutional layers (now uses tf internal layer and not the one in contrib)
Several additional minor fixes
Contributors
@carlogrisetti @jaipradeesh @glongh @dsblank @danicattaneob @gogasca @lordeddard @IgorWilbert @patrickvonplaten @ojus1 @jimthompson5802 @johnwahba @revolunet @gogasca
v0.1.2: Import speed improvements, safety-related fixes and various minor fixes and improvements
Improvements
- Improved import speed by ~50%
- Improved Comet.ml integration
- Replaced
only_predict
withevaluate_performance
(and flipped the logic) in all predict commands and functions - Refactored preprocessing functions for improved testability, understanbility and extensibility
- Added
data_dict
to the train method inLudwigModel
- Improved tests speed
Bugfixes
- Fix issue #283:
word_format
in text features is now properly used - Fix issue #286: avoid using signal when not on main thread
- Fix issue where the order of operations when preprocessing images between resizing and changing channels was inverted
- Fix safety issues: now using
yaml.safe_load
instead ofyaml.load
and replaced pickling of the progress tracker with a JSON equivalent - Fix minor bug with missing
tied_weights
key in some features - Fixed a few minor issues discovered with deepsource.io
Other Changes
- If before
LudwigModel
would be imported fromludwig
now it should be imported fromludwig.api
. This change was needed for speeding up imports
Contributors
v0.1.1: Bug fixes, new parameters and Comet.ml integration
New features and improvements
- Updated to tensorflow 1.13.1 and spacy 2.1 (this also makes Ludwig compatible with Python 3.7)
- Added an initial integration with Comet.ml
- Added support for text preprocessing of additional languages: Italian, Spanish, German, French, Portuguese, Dutch, Greek and Multi-language (Fature Request #251).
- Added
skip_save_progress
,skip_save_model
andskip_save_log
parameters - Improved the default parameters of the image feature (this may make previously trained models including image features not compatible. If that is the case retrain your model)
- Added
PassthroughEncoder
- Added
eval_batch_size
parameter - Added sanity checks for model definitions, with improved error messages
- Add Dockerfile for running Ludwig on a CPU
- Added clip parameter to numerical output features
- Added a full MNIST training example, a fraud detection example and a more complex regression example on fuel consumption
Bug fixes
- Fix issue #56: removing just keys that exist in dataset when when replacing text feature names concatenating their level
- Fix issue #46 #144: Solved Mac OS X mpl.use('TkAgg') use
- Fix issue #74: Call subprocess within try except
- Fix issue #81: Opens a file before calling yaml.load()
- Fix issue #90: Forcing csv writer to write utf-8 encoded files
- Fix issue #120: Missing sgd (and synonyms) key in optimizers default
- Fix issue #64: Fix for files with capitalized extensions
- Fix issue #121: Typo bucketin_field to bucketing_field
- Fix training when validation or test cvs are provided separately
- Fix issue #112: dataframe_df may not have a csv attribute
- Fix missing checks if dataset is None in preprocessing.py and api.py
- Fix error measure aggregation and default value
- Fix image interpolation
- Fix preprocessing_defaults error in bag_feature.py
- Fix text output features populate_defaults() and update_model_definition_with_metadata()
- Fix in timeseires placeholder datatype
- Moved image preprocessing params to preprocessing section (this may make previously trained models including image features not compatible. If that is the case retrain your model)
- Fix warmup learning rate function for distributed training
- Fix issue #214: replace_text_feature_level usage in api.py
- Fix issue #214: replaced SPACE_PUNCTUATION_REGEX
- Fix issue #229 #100: solved missing hdf5 / csv file reference
- Fix issue #222: incorrect logging in read_csv
- Fix issue #194: Renaming class_distance to class_similarities and several bugfixes regarding class_similarities, class_weights and their interaction at model building time
- Fix issue #100 #225: solves image prediction issues
- Fix issue #98: solves dealing with images with different numbers of channels, including transparencies
- Fix unwanted creation of hdf5 files when running ludwig.predict on images
- And few more minor fixes
Contributors
Thanks to all our amazing contributors (some of your PRs were not merged, but we used some of their code in our commits, so thank you anyway!):
@dsblank @MariusDanner @BenMacKenzie @Barathwaja @gabefair @kevinqz @yantsey @jontonsoup4 @Praneet460 @DakshMiglani @syeef @tejaf @rolisz @JakeConnors376W @Andyzzh @us @0xflotus @laserbeam3 @krychu @dettmering @bbrodsky @c-m-hunt @C0deFxxker @hemchander23 @Shivam-Beeyani @yashrajbharti @rbramwell @emushtaq @EBazarov @graytowne @jovilius @ivanhe @philippgille @floscha