Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplifying data processing #9

Open
mjpost opened this issue Dec 11, 2018 · 8 comments
Open

Simplifying data processing #9

mjpost opened this issue Dec 11, 2018 · 8 comments

Comments

@mjpost
Copy link
Collaborator

mjpost commented Dec 11, 2018

I would like to propose that we remove the dummy* tasks that allow for data to be merged. It adds a lot of complexity to the task structure as well as to the test time variables.

Instead, I would propose that data processing dependencies be changed from files (<) to variables (::), and then that we allow multiple values in each variable. If there are multiple values, the staging step will concatenate them.

This retains the ability to have multiple training or dev datasets, but removes the complexity from ducttape and puts it all into the preprocessing stage.

Thoughts?

@shuoyangd
Copy link
Owner

Can you elaborate how would you model dependencies with variables? From my perspective, if you use variables rather than files, all the tasks without a input file will run at the same time without waiting for their precedents to finish.

@mjpost
Copy link
Collaborator Author

mjpost commented Dec 14, 2018

Okay, see the PR I just made. You can consider the PR as work in progress. The training data is now listed in a single variable as a space-delimited list of file prefixes. The merging is done in download_or_link. Lots of the code and variables are simplified this way. XML is currently not supported but (a) I think we should get rid of it anyway and (b) it would be simple to add it in download_or_link (though not to use it again for scoring).

@shuoyangd
Copy link
Owner

shuoyangd commented Dec 14, 2018

I think you are mixing two different things here: dummy* tasks are not caused by merging -- they are due to a bug in DuctTape that caused us not being able do branch grafting on variables (we can graft on task outputs, though). I don't know if you tested your changes yet, but stuff like this looks like it will crash the task.

So basically dummy* tasks are temporarily used to wrap up variables like tokenized_data and make them look like task outputs. The right way to get rid of these tasks is to fix this bug in DuctTape.

Now on to the issue of the merge task. Your solution to get rid of merge task is fine, but given the fact that it has nothing to do with the dummy* tasks, I don't think it's really necessary to get rid of merge -- especially if you need to make up a carefully-formatted variable to denote a list of data files (and there is always danger that you mess up the orders). Enlighten me if I missed some other motivations for this.

@mjpost
Copy link
Collaborator Author

mjpost commented Dec 14, 2018

Regarding dummy tasks: the line you point to doesn't crash, though you're right that it doesn't work. However, you can define variables with the graft, like this, which seems clearer to me than using dummy tasks. You can see I've done this for the sacrebleu task. So there is a workaround that doesn't require grafting on variables.

@mjpost
Copy link
Collaborator Author

mjpost commented Dec 14, 2018

Also, I think branch grafts are allowed on variables: jhclark/ducttape#30. I just tested the code you pointed to, and it works fine.

@shuoyangd
Copy link
Owner

You can't graft on variables like this though.

Here is a minimal example that fails.

@mjpost
Copy link
Collaborator Author

mjpost commented Jan 11, 2019

This is interesting, because I do glob on variables sometimes. See this for example, which works fine.

task prepare_data : sockeye
    < train_src_in=$prepared_data[DataSection:train,side:src]
    < train_trg_in=$prepared_data[DataSection:train,side:trg]
    < factor_files=(SourceFactors: no="/dev/null" yes=$factor_files@compute_source_factors[DataSection:train])
    > data
    ...

$prepared_data is defined here in the same way:

global {
  prepared_data=(SubwordMethod:
    sentencepiece=$out@apply_sentencepiece
    bpe=$out@apply_bpe
    none=$recased_data
  )
}

@mjpost
Copy link
Collaborator Author

mjpost commented Jan 11, 2019

As we discussed, it seems that it breaks when a variable is defined with nested branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants