layout | title |
---|---|
default |
Executing processes in the data flow |
After compiling a DeepDive application, each compiled process in the data flow defined by the application can be executed with great flexibility. DeepDive provides a core set of commands that supports precise control of the execution. The following tasks are a few examples that are easily doable in DeepDive's vocabulary:
- Executing the complete data flow.
- Executing a fragment of the complete data flow.
- Stopping execution at any point and resuming from where it stopped.
- Repeating certain processes with different parameters.
- Skipping expensive processes by loading their output from external data sources.
To simply run the complete data flow in an application from the beginning to the end, use the following command under any subdirectory of the application:
deepdive run
This command will:
- Initialize the application as well as its configured database.
- Run all processes that correspond to the normal derivation rules as well as the rules with user-defined functions.
- Ground the factor graph according to the inference rules defining the model for statistical inference.
- Perform weight learning and inference to compute marginal probability of every variable.
- Generate calibration plots and data for debugging.
There are several execution scenarios that frequently arise while developing a DeepDive application.
Any of the commands shown in this page can be run under any subdirectory of a DeepDive application.
To see all options for each command, the online help message can be seen with the deepdive help
command.
For example, the following shows detailed usage of deepdive do
command:
deepdive help do
Running only a small part of the data flow defined in an application is the most common scenario. The following command allows the execution to stop at given TARGETs instead of continuing to the end.
deepdive do TARGET...
It will present a shell script in a text editor that enumerates the processes to be run for the given TARGETs. By saving a final plan and quiting the editor, the actual execution starts.
Valid TARGET names are shown when no argument is given.
deepdive do
Refer to the next section for more detail about these TARGET names.
The execution started can be interrupted at any time (with Ctrl-C or ^C) or aborted by an error, then resumed later.
deepdive do TARGET...
...
^C
...
deepdive do TARGET...
DeepDive will resume the execution from the last unfinished process, skipping what has already been done.
Repeating certain parts of the data flow is another common scenario.
DeepDive provides a way to mark certain processes as not done so they can be repeated.
For example, the sequence of two commands below is basically what deepdive run
does, i.e., repeats the end-to-end execution.
deepdive mark todo init/app calibration weights
deepdive do init/app calibration weights
DeepDive provides a shorthand deepdive redo
for easier repeating:
deepdive redo init/app calibration weights
Skipping certain parts of the data flow and starting from data manually loaded is also easily doable.
This can be useful to skip certain processes that take very long.
For example, suppose relation bar_derived_from_foo
is derived from relation foo
, and foo
takes excessive amount of time to compute and therefore has been saved at /some/data/source.tsv
.
Then the following sequence of commands skips all processes that foo
depends on, assumes it is new, and executes the processes that derive bar_derived_from_foo
from foo
.
deepdive create table foo
deepdive load foo /some/data/source.tsv
deepdive mark new foo
deepdive do bar_derived_from_foo
For execution, DeepDive produces a compiled data flow graph that consists of primarily data, model, and process nodes and edges representing dependencies between them.
- Data nodes correspond to relations or tables in the database.
- Model nodes denote artifacts for statistical learning and inference.
- Process nodes represent a unit of computation.
- An edge between a process and data/model nodes mean the process takes as input or produces such node.
- An edge between two processes denotes one is dependent on another while hiding the details about the intermediary nodes.
DeepDive compiles a few built-in processes into the data flow graph that are necessary for initialization, statistical learning and inference, and calibration.
The rest of the processes correspond to the rules for deriving relations according to the DDlog program and the extractors in deepdive.conf.
Below is a data flow graph compiled for the tutorial example, which can be found at run/dataflow.svg
for any application.
DeepDive uses the compiled data flow graph to find the correct order of processes to execute. Any node or set of nodes in the data flow graph can be set as the target for execution, and DeepDive enumerates all necessary processes as an execution plan.
This execution plan can be seen using the deepdive plan
command.
For example, when a plan for data/sentences
is asked using the following command:
deepdive plan data/sentences
DeepDive gives an output that looks like:
# execution plan for sentences
: ## process/init/app ##########################################################
: # Done: 2016-02-01T20:56:50-0800 (2d 23h 55m 9s ago)
: process/init/app/run.sh
: mark_done process/init/app
: ##############################################################################
:
: ## process/init/relation/articles ############################################
: # Done: 2016-02-01T20:56:55-0800 (2d 23h 55m 4s ago)
: process/init/relation/articles/run.sh
: mark_done process/init/relation/articles
: ##############################################################################
:
: ## data/articles #############################################################
: # Done: 2016-02-01T20:56:55-0800 (2d 23h 55m 4s ago)
: # no-op
: mark_done data/articles
: ##############################################################################
## process/ext_sentences_by_nlp_markup #######################################
# Done: N/A
process/ext_sentences_by_nlp_markup/run.sh
mark_done process/ext_sentences_by_nlp_markup
##############################################################################
## data/sentences ############################################################
# Done: N/A
# no-op
mark_done data/sentences
##############################################################################
An execution plan is basically a shell script that invokes the actual run.sh compiled for each process.
Processes that have been already marked as done in the past are commented out (with :
), and exactly when they were done is displayed in the comments.
DeepDive provides a deepdive do
command that takes as input the target nodes in the data flow graph, presents the execution plan in an editor, then executes the final plan.
Because of the provided chance to modify the generated execution plan, the user has complete control over what is executed, and override or skip certain processes if needed.
deepdive do data/sentences
After a process finishes its execution, the execution plan includes a mark_done
command to mark the executed process as done.
The command touches a process/name.done
file under the run/
directory to record a timestamp.
These timestamps are then used for determining which processes have finished and which others have not.
DeepDive provides a deepdive mark
command to manipulate such timestamps to repeat or skip certain parts of the data flow.
It allows a given node to be marked as:
-
done
So all processes depending on the process including itself can be skipped.
-
todo-from-scratch
So all processes from the first process in the execution plan can be repeated.
-
todo
So all processes that depend on the node including itself can be repeated.
-
new
So all processes that depend on the node can be repeated (not including itself).
-
all-new
So all processes that depend on the node or any of its ancestor can be repeated (not including itself).
For example, if we mark a process as to be repeated using the following command:
deepdive mark todo process/init/relation/articles
Then deepdive plan
will give an output like below to repeat the processes already marked as done in the past:
# execution plan for sentences
: ## process/init/app ##########################################################
: # Done: 2016-02-01T20:56:50-0800 (2d 23h 55m 39s ago)
: process/init/app/run.sh
: mark_done process/init/app
: ##############################################################################
## process/init/relation/articles ############################################
# Done: 2016-02-01T20:56:55-0800 (2d 23h 55m 34s ago)
process/init/relation/articles/run.sh
mark_done process/init/relation/articles
##############################################################################
## data/articles #############################################################
# Done: 2016-02-01T20:56:55-0800 (2d 23h 55m 34s ago)
# no-op
mark_done data/articles
##############################################################################
## process/ext_sentences_by_nlp_markup #######################################
# Done: N/A
process/ext_sentences_by_nlp_markup/run.sh
mark_done process/ext_sentences_by_nlp_markup
##############################################################################
## data/sentences ############################################################
# Done: N/A
# no-op
mark_done data/sentences
##############################################################################
DeepDive adds several built-in processes to the compiled data flow to ensure necessary steps are performed before and after the user-defined processes.
-
process/init/app
This process initializes the application's database and executes the
input/init.sh
script if available, which can be used for ensuring input data is downloaded and libraries and code required by UDFs are set up correctly.All processes that are not dependent on any other automatically becomes dependent on this process to ensure every process is executed after the application is initialized.
-
process/init/relation/R
Such process creates the table R in the database and loads data from
input/R.*
. These processes are automatically created and added to the data flow for every relation that is not derived from any other relation nor output by any process. -
process/grounding/*
All processes for grounding the factor graph are put under this namespace.
-
process/model/*
All processes for learning and inference are put under this namespace.
-
model/*
All artifacts related to the statistical inference model that do not belong to the database are put under this namespace, e.g.:
-
model/factorgraph
Denotes factor graph binary files under
run/model/factorgraph/
. -
model/weights
Denotes text files that contain learned weights of the model under
run/model/weights/
. -
model/probabilities
Denotes files holding the computed marginal probabilities under
run/model/probabilities/
-
model/calibration-plots
Denotes calibration plot images and data files under
run/model/calibration-plots/
.
-
-
data/model/*
All data relevant to statistical inference that go into the database are put under this namespace. For example,
data/model/weights
anddata/model/probabilities
correspond to tables and views that keep the learning and inference results.
There are several environment variables that can be tweaked to influence the execution of processes for UDFs.
-
DEEPDIVE_NUM_PROCESSES
Controls the number of UDF processes to run in parallel. This defaults to one less than the number of processors, minimum one.
-
DEEPDIVE_NUM_PARALLEL_UNLOADS
andDEEPDIVE_NUM_PARALLEL_LOADS
Controls the number of processes to run in parallel for unloading from and loading to the database. These default to one.
-
DEEPDIVE_PLAN_EDIT
Controls whether a chance to edit the execution plan is provided (when set to
true
) or not (whenfalse
). -
VISUAL
andEDITOR
Decides which editor to use. Defaults to
vi
.