DSBox AutoML System

The Data Scientist in a Box (DSBox) system is an AutoML system that automates the generation of machine learning pipelines, including data augmentation, data cleaning, data featurization, model selection and hyperparameter tuning. DSBox uses a highly customizable pipeline template representation that allows advance users to easily incorporate their machine learning knowledge to configure the pipeline search space. For non-expert users DSBox generates high quality machine learning models with minimal need of intervention.

DSBox is part of the Data Driven Discovery of Models (D3M) program. More information about the program and about related projects can be found at the D3M website.

FINAL VERSION

This is the final release version of DSBox AutoML system. The dockerlized version is available at https://hub.docker.com/repository/docker/ckxz105/dsbox-ta2 (public). This docker has all primitives installed and ready to run except some speicial primitives that required static files.

If running with some special primitives(most are images/audio related), you need to download the static files. To download them, please add extra python3 -m d3m index download inside the docker build file (available at dockerBuildFiles folder of this repo).

Usage

To use the dsbox-ta2 system only. Run with

docker run --entrypoint /user_opt/client.sh \
       --name $docker_name \
       -m 200G \
       --shm-size=50g \
       --cpus=20 \
       -e D3MRUN=ta2 \
       -e D3MINPUTDIR=/input \
       -e D3MOUTPUTDIR=/output \
       -e D3MLOCALDIR=/tmp \
       -e D3MSTATICDIR=/static \
       -e D3MPROBLEMPATH=/input/TRAIN/problem_TRAIN/problemDoc.json \
       -e CUDA_VISIBLE_DEVICES=$gpu_id \
       -e D3MCPU=$D3MCPU \
       -e D3MRAM=50 \
       -e D3MTIMEOUT=$D3MTIMEOUT \
       -e DSBOX_LOGGING_LEVEL="dsbox=DEBUG:console_logging_level=WARNING:file_logging_level=DEBUG" \
       -v ${dataset_dir}/${problem}:/input \
       -v ${output_dir}:/output \
       -v /data/1/dsbox/static_files/static/:/static \
       $docker_image

The cpu / memory size can be modified by the users. According to different problems given. The logging level can also be changed. The important part is to sent corresponding dataset path {dataset_dir}, problem name {problem} and output path {output_dir}.

Input and Output format

The input dataset should follow the requirement of D3M projects. There is one sample dataset available at dsbox-ta2/unit_tests/resources For detail structures of the dataset, please refer to https://gitlab.com/datadrivendiscovery/data-supply/-/tree/v4.0.0 (public repo) Existed datasets: https://gitlab.datadrivendiscovery.org/d3m/datasets. (D3M users only)

Output

The main focus output will be some json pipeline files at output/pipelines_ranked. Those pipeline files can then be used to run and make predictions. For detail using of those pipelines, please refer to https://gitlab.com/datadrivendiscovery/d3m/-/tree/v2020.1.9/docs

There is also a score folder which has the prediction scores based on the score dataset. If the score file is empty, it means the pipeline failed.

Name	Name	Last commit message	Last commit date
Latest commit kyao Update README.md Jan 18, 2021 85e0e8f · Jan 18, 2021 History 1,736 Commits
dockerBuildFiles	dockerBuildFiles	add docker build relate files	Feb 19, 2020
python	python	Merge branch 'development' of https://github.com/usc-isi-i2/dsbox-ta2 …	Feb 17, 2020
unit_tests	unit_tests	use development branches	Mar 6, 2019
.gitignore	.gitignore	ignore vscode	Jul 1, 2019
.travis.yml	.travis.yml	update travis	Mar 6, 2019
LICENSE	LICENSE	Initial commit	Apr 1, 2017
README.md	README.md	Update README.md	Jan 18, 2021
__init__.py	__init__.py	- refactoring the "search.py" to "ConfigurationSpaceBaseSearch.py" an…	Jul 31, 2018
config.json	config.json	job manager is really fixed now:	Oct 16, 2018
config_new.json	config_new.json	fix issues regarding error checking	Aug 13, 2018
requirements.txt	requirements.txt	update d3m requirement	Feb 3, 2020
setupEnv.sh	setupEnv.sh	adding the setupEnv.sh file to automaticly handle the set of modifica…	Jun 28, 2018
sortTop20.py	sortTop20.py	fix issues regarding error checking	Aug 13, 2018
update_log.md	update_log.md	rename old readme to update log	Apr 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSBox AutoML System

FINAL VERSION

Usage

Input and Output format

Output

About

Releases 4

Packages

Contributors 14

Languages

License

usc-isi-i2/dsbox-ta2

Folders and files

Latest commit

History

Repository files navigation

DSBox AutoML System

FINAL VERSION

Usage

Input and Output format

Output

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 14

Languages

Packages