New Data Questions #23

myeomans · 2018-10-01T19:29:02Z

Hello, all! I find your dataset fascinating, and I am glad you posted new chats from this summer. But I am having some trouble understanding the formatting. It has changed significantly since the first data dump, and the documentation does not address these changes. I have listed the major issues below, can you clarify?

There is no longer any context text in the new files, was this dropped?
In some (but not all) of the files there are no longer any user profiles. Was this dropped in the middle of the data collection?
There is also only one evaluation metric ("eval_score"), rather than three ("breadth","engagement", and "quality"). Was the paradigm changed from the first rounds? And what is "profile_match" all about?
How are we supposed to know which participant is the bot and which is the human? Are they consistently labeled (e.g. participant1 is always human) or is there a separate key we need?

In summary, this is a fantastic resource but I am not sure how useful it is without understanding how the data was assembled. Or, is there an updated data dictionary available anywhere?

madrugado · 2018-10-11T16:04:08Z

Hi Michael,

Sorry for delayed response. The answers for your questions:

In ConvAI2 we are testing bots to pretend to be some person. So instead of context we provide participants with persona description.
Could you please elaborate your question?
For ConvAI2 we've changed the metrics from mentioned three to only one describing quality, since all three were highly correlated. And also added a metric for role-playing.
Thank you for pointing this issue out, we will decide how to add this information and let you know when we add it.

Thank you for your interest!

myeomans · 2018-10-12T01:07:19Z

Thanks for the response! We have been using your orginal data to replicate some of our own experimental data, so you can imagine our interest in the new data, as well.

great!
There is a "user profile" slot and a "bot profile" in the json files for July 6-7, but not July 4-5.
great!
Please let us know. This is pretty critical for our replication, and for data quality. It does seem like some of the bots were a bit chaotic, so the "bot fixed effect" is useful to estimate here.

madrugado · 2018-10-12T15:05:43Z

We've merged all the data into one file and added human/bot markup. The dataset is located in the same folder: https://github.com/DeepPavlov/convai/blob/master/data/summer_wild_evaluation_dialogs.json

Please feel free to contact us again.

myeomans · 2018-10-12T19:43:32Z

Thank you, I appreciate it! I'll let you know what we find.

myeomans · 2019-03-19T01:56:46Z

Hi, I want to follow up on this thread - our paper has received an R & R, and we were asked a specific question by one of our reviewers, that is related to this thread above. We are wondering whether it would be possible to re-open this issue with you, now that the contest is over? Specifically, we would like to know which bots were participating in each conversation. We don't need identifiable names - rather, we simply want to have a hashed identifier of each bot, so that we can cluster our standard errors at the bot level and adjust for bot-level fixed effects. We are assuming the humans are all unique, as well? Please let me know if you think this data would be shareable here.

Thank you,
Mike

madrugado · 2019-03-20T08:54:16Z

Hi Mike,

At least for one part of the dataset this additional information is available: http://convai.io/data/data_tolokers.json
(Also, there are anonymized user ids also presented there.)

We'll discuss if we should make available such information for other parts.

Best Regards,
Valentin

myeomans · 2019-03-25T15:21:59Z

Thank you! The anonymized user IDs in this file are exactly what we were looking for. It looks like you've done a bit of data cleaning, too, dropping broken bots (which mostly overlaps with our own cleaning).

nirav0999 · 2021-08-18T21:16:42Z

Hi,
The bots in the http://convai.io/data/data_tolokers.json are mentioned with anonymized ID's like Bot 001, Bot 002.....
Could you provide a mapping of these Bot ID's to the leader board (i.e for e.g huggingface model is Bot 001)? We are looking at specific bot-human conversational data.

jaseweston assigned varvara-l and DeepPavlovAdmin Oct 4, 2018

madrugado closed this as completed Oct 12, 2018

madrugado reopened this Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Data Questions #23

New Data Questions #23

myeomans commented Oct 1, 2018

madrugado commented Oct 11, 2018

myeomans commented Oct 12, 2018

madrugado commented Oct 12, 2018

myeomans commented Oct 12, 2018

myeomans commented Mar 19, 2019

madrugado commented Mar 20, 2019

myeomans commented Mar 25, 2019

nirav0999 commented Aug 18, 2021

New Data Questions #23

New Data Questions #23

Comments

myeomans commented Oct 1, 2018

madrugado commented Oct 11, 2018

myeomans commented Oct 12, 2018

madrugado commented Oct 12, 2018

myeomans commented Oct 12, 2018

myeomans commented Mar 19, 2019

madrugado commented Mar 20, 2019

myeomans commented Mar 25, 2019

nirav0999 commented Aug 18, 2021