Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] What license is used for this FLAN dataset(not the code). #67

Open
quq99 opened this issue May 1, 2023 · 10 comments
Open

[Question] What license is used for this FLAN dataset(not the code). #67

quq99 opened this issue May 1, 2023 · 10 comments

Comments

@quq99
Copy link

quq99 commented May 1, 2023

Hi,

Thanks a lot for open source the code to fetch the FLAN data set.

I noticed in the paper: The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. (https://arxiv.org/abs/2301.13688) you mentioned

"to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at this https URL."

I noticed that this repo used Apache 2.0 license. Is the FLAN data set that fetched from the code also under Apache 2.0 license?

Thanks a lot!

@shayne-longpre
Copy link
Collaborator

@quq99 Good question. As the Flan Collection (or P3, or Natural Instructions v2) is a compilation of hundreds of different datasets, with many different licenses, the rendered data would not be under Apache 2.0.

I am actually working on a full labelling of the dataset licenses and plan to release this publicly soon, so that users can take the subset of Flan that fits their licensing constraints.

@quq99
Copy link
Author

quq99 commented May 2, 2023

@shayne-longpre Thanks a lot! looking forward to that. When you finish, could you reply in this issue, so I could know. Appreciate your work!!

@shayne-longpre
Copy link
Collaborator

@quq99 Update: we plan to release this in the last week of May.

@balachandarsv
Copy link

@shayne-longpre Looking forward to the dataset labeled with license. Thanks for the effort!

@balachandarsv
Copy link

@shayne-longpre any update on the above license part? Were you able to complete it?

@shayne-longpre
Copy link
Collaborator

@balachandarsv apologies again for the wait on this. It turns out license labelling is much more complex than we had originally anticipated.

It has gone from a side project into my next major release, with a lot more data selection/partitioning features being added, not just for Flan, but a lot of relevant data sources. It's tentatively slated for mid-July. I hope this isn't too inconvenient and apologies again on the delay.

@balachandarsv
Copy link

@shayne-longpre No problem at all. Please let me know in case if you need help in sorting out the data according to license. I will be happy to help! :-)

@cchenv
Copy link

cchenv commented Jun 27, 2023

Hi @shayne-longpre thanks for labeling all the licenses in the Flan Collection! I'm a bit confused about the Flan-T5 models' Apache-2.0 license, i.e., if some datasets in the Flan Collection have to be removed due to license constraint, why the Flan-T5 models can have Apache-2.0? Were they trained with only permissive datasets?

@muupan
Copy link

muupan commented Apr 4, 2024

Any updates?

@shayne-longpre
Copy link
Collaborator

Sorry for the long delay -- I am not at Google so haven't been maintaining this.

Licenses have been annotated for Flan and many more datasets here: https://github.com/Data-Provenance-Initiative/Data-Provenance-Collection. It is not however legal advice -- the interpretation of the licenses to the data is complicated and requires a lawyer. These annotations are to provide information that can enable you to apply your own legal/ethical framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants