-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] What license is used for this FLAN dataset(not the code). #67
Comments
@quq99 Good question. As the Flan Collection (or P3, or Natural Instructions v2) is a compilation of hundreds of different datasets, with many different licenses, the rendered data would not be under Apache 2.0. I am actually working on a full labelling of the dataset licenses and plan to release this publicly soon, so that users can take the subset of Flan that fits their licensing constraints. |
@shayne-longpre Thanks a lot! looking forward to that. When you finish, could you reply in this issue, so I could know. Appreciate your work!! |
@quq99 Update: we plan to release this in the last week of May. |
@shayne-longpre Looking forward to the dataset labeled with license. Thanks for the effort! |
@shayne-longpre any update on the above license part? Were you able to complete it? |
@balachandarsv apologies again for the wait on this. It turns out license labelling is much more complex than we had originally anticipated. It has gone from a side project into my next major release, with a lot more data selection/partitioning features being added, not just for Flan, but a lot of relevant data sources. It's tentatively slated for mid-July. I hope this isn't too inconvenient and apologies again on the delay. |
@shayne-longpre No problem at all. Please let me know in case if you need help in sorting out the data according to license. I will be happy to help! :-) |
Hi @shayne-longpre thanks for labeling all the licenses in the Flan Collection! I'm a bit confused about the Flan-T5 models' Apache-2.0 license, i.e., if some datasets in the Flan Collection have to be removed due to license constraint, why the Flan-T5 models can have Apache-2.0? Were they trained with only permissive datasets? |
Any updates? |
Sorry for the long delay -- I am not at Google so haven't been maintaining this. Licenses have been annotated for Flan and many more datasets here: https://github.com/Data-Provenance-Initiative/Data-Provenance-Collection. It is not however legal advice -- the interpretation of the licenses to the data is complicated and requires a lawyer. These annotations are to provide information that can enable you to apply your own legal/ethical framework. |
Hi,
Thanks a lot for open source the code to fetch the FLAN data set.
I noticed in the paper: The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. (https://arxiv.org/abs/2301.13688) you mentioned
I noticed that this repo used Apache 2.0 license. Is the FLAN data set that fetched from the code also under Apache 2.0 license?
Thanks a lot!
The text was updated successfully, but these errors were encountered: