Simplifying data processing #9

mjpost · 2018-12-11T00:16:40Z

I would like to propose that we remove the dummy* tasks that allow for data to be merged. It adds a lot of complexity to the task structure as well as to the test time variables.

Instead, I would propose that data processing dependencies be changed from files (<) to variables (::), and then that we allow multiple values in each variable. If there are multiple values, the staging step will concatenate them.

This retains the ability to have multiple training or dev datasets, but removes the complexity from ducttape and puts it all into the preprocessing stage.

Thoughts?

The text was updated successfully, but these errors were encountered:

shuoyangd · 2018-12-11T00:55:44Z

Can you elaborate how would you model dependencies with variables? From my perspective, if you use variables rather than files, all the tasks without a input file will run at the same time without waiting for their precedents to finish.

mjpost · 2018-12-14T14:50:58Z

Okay, see the PR I just made. You can consider the PR as work in progress. The training data is now listed in a single variable as a space-delimited list of file prefixes. The merging is done in download_or_link. Lots of the code and variables are simplified this way. XML is currently not supported but (a) I think we should get rid of it anyway and (b) it would be simple to add it in download_or_link (though not to use it again for scoring).

shuoyangd · 2018-12-14T18:12:00Z

I think you are mixing two different things here: dummy* tasks are not caused by merging -- they are due to a bug in DuctTape that caused us not being able do branch grafting on variables (we can graft on task outputs, though). I don't know if you tested your changes yet, but stuff like this looks like it will crash the task.

So basically dummy* tasks are temporarily used to wrap up variables like tokenized_data and make them look like task outputs. The right way to get rid of these tasks is to fix this bug in DuctTape.

Now on to the issue of the merge task. Your solution to get rid of merge task is fine, but given the fact that it has nothing to do with the dummy* tasks, I don't think it's really necessary to get rid of merge -- especially if you need to make up a carefully-formatted variable to denote a list of data files (and there is always danger that you mess up the orders). Enlighten me if I missed some other motivations for this.

mjpost · 2018-12-14T18:35:06Z

Regarding dummy tasks: the line you point to doesn't crash, though you're right that it doesn't work. However, you can define variables with the graft, like this, which seems clearer to me than using dummy tasks. You can see I've done this for the sacrebleu task. So there is a workaround that doesn't require grafting on variables.

mjpost · 2018-12-14T18:45:34Z

Also, I think branch grafts are allowed on variables: jhclark/ducttape#30. I just tested the code you pointed to, and it works fine.

shuoyangd · 2018-12-14T18:54:10Z

You can't graft on variables like this though.

Here is a minimal example that fails.

mjpost · 2019-01-11T18:45:17Z

This is interesting, because I do glob on variables sometimes. See this for example, which works fine.

task prepare_data : sockeye
    < train_src_in=$prepared_data[DataSection:train,side:src]
    < train_trg_in=$prepared_data[DataSection:train,side:trg]
    < factor_files=(SourceFactors: no="/dev/null" yes=$factor_files@compute_source_factors[DataSection:train])
    > data
    ...

$prepared_data is defined here in the same way:

global {
  prepared_data=(SubwordMethod:
    sentencepiece=$out@apply_sentencepiece
    bpe=$out@apply_bpe
    none=$recased_data
  )
}

mjpost · 2019-01-11T18:46:43Z

As we discussed, it seems that it breaks when a variable is defined with nested branches.

mjpost mentioned this issue Dec 14, 2018

removed merging, dummy steps in favor of file prefixes #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplifying data processing #9

Simplifying data processing #9

mjpost commented Dec 11, 2018

shuoyangd commented Dec 11, 2018

mjpost commented Dec 14, 2018

shuoyangd commented Dec 14, 2018 •

edited

Loading

mjpost commented Dec 14, 2018

mjpost commented Dec 14, 2018

shuoyangd commented Dec 14, 2018

mjpost commented Jan 11, 2019

mjpost commented Jan 11, 2019

Simplifying data processing #9

Simplifying data processing #9

Comments

mjpost commented Dec 11, 2018

shuoyangd commented Dec 11, 2018

mjpost commented Dec 14, 2018

shuoyangd commented Dec 14, 2018 • edited Loading

mjpost commented Dec 14, 2018

mjpost commented Dec 14, 2018

shuoyangd commented Dec 14, 2018

mjpost commented Jan 11, 2019

mjpost commented Jan 11, 2019

shuoyangd commented Dec 14, 2018 •

edited

Loading