Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Data Questions #23

Open
myeomans opened this issue Oct 1, 2018 · 8 comments
Open

New Data Questions #23

myeomans opened this issue Oct 1, 2018 · 8 comments
Assignees

Comments

@myeomans
Copy link

myeomans commented Oct 1, 2018

Hello, all! I find your dataset fascinating, and I am glad you posted new chats from this summer. But I am having some trouble understanding the formatting. It has changed significantly since the first data dump, and the documentation does not address these changes. I have listed the major issues below, can you clarify?

  1. There is no longer any context text in the new files, was this dropped?

  2. In some (but not all) of the files there are no longer any user profiles. Was this dropped in the middle of the data collection?

  3. There is also only one evaluation metric ("eval_score"), rather than three ("breadth","engagement", and "quality"). Was the paradigm changed from the first rounds? And what is "profile_match" all about?

  4. How are we supposed to know which participant is the bot and which is the human? Are they consistently labeled (e.g. participant1 is always human) or is there a separate key we need?

In summary, this is a fantastic resource but I am not sure how useful it is without understanding how the data was assembled. Or, is there an updated data dictionary available anywhere?

@madrugado
Copy link
Collaborator

Hi Michael,

Sorry for delayed response. The answers for your questions:

  1. In ConvAI2 we are testing bots to pretend to be some person. So instead of context we provide participants with persona description.
  2. Could you please elaborate your question?
  3. For ConvAI2 we've changed the metrics from mentioned three to only one describing quality, since all three were highly correlated. And also added a metric for role-playing.
  4. Thank you for pointing this issue out, we will decide how to add this information and let you know when we add it.

Thank you for your interest!

@myeomans
Copy link
Author

Thanks for the response! We have been using your orginal data to replicate some of our own experimental data, so you can imagine our interest in the new data, as well.

  1. great!
  2. There is a "user profile" slot and a "bot profile" in the json files for July 6-7, but not July 4-5.
  3. great!
  4. Please let us know. This is pretty critical for our replication, and for data quality. It does seem like some of the bots were a bit chaotic, so the "bot fixed effect" is useful to estimate here.

@madrugado
Copy link
Collaborator

We've merged all the data into one file and added human/bot markup. The dataset is located in the same folder: https://github.com/DeepPavlov/convai/blob/master/data/summer_wild_evaluation_dialogs.json

Please feel free to contact us again.

@myeomans
Copy link
Author

Thank you, I appreciate it! I'll let you know what we find.

@myeomans
Copy link
Author

Hi, I want to follow up on this thread - our paper has received an R & R, and we were asked a specific question by one of our reviewers, that is related to this thread above. We are wondering whether it would be possible to re-open this issue with you, now that the contest is over? Specifically, we would like to know which bots were participating in each conversation. We don't need identifiable names - rather, we simply want to have a hashed identifier of each bot, so that we can cluster our standard errors at the bot level and adjust for bot-level fixed effects. We are assuming the humans are all unique, as well? Please let me know if you think this data would be shareable here.

Thank you,
Mike

@madrugado
Copy link
Collaborator

Hi Mike,

At least for one part of the dataset this additional information is available: http://convai.io/data/data_tolokers.json
(Also, there are anonymized user ids also presented there.)

We'll discuss if we should make available such information for other parts.

Best Regards,
Valentin

@madrugado madrugado reopened this Mar 20, 2019
@myeomans
Copy link
Author

Thank you! The anonymized user IDs in this file are exactly what we were looking for. It looks like you've done a bit of data cleaning, too, dropping broken bots (which mostly overlaps with our own cleaning).

@nirav0999
Copy link

Hi,
The bots in the http://convai.io/data/data_tolokers.json are mentioned with anonymized ID's like Bot 001, Bot 002.....
Could you provide a mapping of these Bot ID's to the leader board (i.e for e.g huggingface model is Bot 001)? We are looking at specific bot-human conversational data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants