-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplifying data processing #9
Comments
Can you elaborate how would you model dependencies with variables? From my perspective, if you use variables rather than files, all the tasks without a input file will run at the same time without waiting for their precedents to finish. |
Okay, see the PR I just made. You can consider the PR as work in progress. The training data is now listed in a single variable as a space-delimited list of file prefixes. The merging is done in download_or_link. Lots of the code and variables are simplified this way. XML is currently not supported but (a) I think we should get rid of it anyway and (b) it would be simple to add it in download_or_link (though not to use it again for scoring). |
I think you are mixing two different things here: So basically Now on to the issue of the |
Regarding dummy tasks: the line you point to doesn't crash, though you're right that it doesn't work. However, you can define variables with the graft, like this, which seems clearer to me than using dummy tasks. You can see I've done this for the sacrebleu task. So there is a workaround that doesn't require grafting on variables. |
Also, I think branch grafts are allowed on variables: jhclark/ducttape#30. I just tested the code you pointed to, and it works fine. |
This is interesting, because I do glob on variables sometimes. See this for example, which works fine.
|
As we discussed, it seems that it breaks when a variable is defined with nested branches. |
I would like to propose that we remove the dummy* tasks that allow for data to be merged. It adds a lot of complexity to the task structure as well as to the test time variables.
Instead, I would propose that data processing dependencies be changed from files (
<
) to variables (::
), and then that we allow multiple values in each variable. If there are multiple values, the staging step will concatenate them.This retains the ability to have multiple training or dev datasets, but removes the complexity from ducttape and puts it all into the preprocessing stage.
Thoughts?
The text was updated successfully, but these errors were encountered: