Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/no answer pipeline #183

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

viktors264
Copy link

Added new noAnswer key and updated generic pipeline aggregation to show all responses without answer.

@Devographics Devographics deleted a comment from netlify bot Jan 17, 2023
@Devographics Devographics deleted a comment from netlify bot Jan 17, 2023
@Devographics Devographics deleted a comment from vercel bot Jan 17, 2023
@Devographics Devographics deleted a comment from netlify bot Jan 17, 2023
@Devographics Devographics deleted a comment from netlify bot Jan 17, 2023
@Devographics Devographics deleted a comment from netlify bot Jan 17, 2023
@Devographics Devographics deleted a comment from netlify bot Jan 17, 2023
@Devographics Devographics deleted a comment from netlify bot Jan 17, 2023
@SachaG
Copy link
Member

SachaG commented Jan 17, 2023

This is a good start! But it's missing a key feature, which is that the no_answer key should be added to buckets, not just facets.

What I mean is that currently this gives up data like this (in this case, "years of experience" with the "gender" facet):

facets: [
        {
          type: 'gender',
          id: 'noAnswer',
          buckets: [
            { id: 'range_5_10', count: 66 },
            { id: 'range_10_20', count: 49 },
            { id: 'range_less_than_1', count: 20 },
          ]
        },
        {
          type: 'gender',
          id: 'not_listed',
          buckets: [
            { id: 'range_2_5', count: 35 },
            { id: 'range_5_10', count: 34 },
            { id: 'range_10_20', count: 39 },
          ]
        },
        {
          type: 'gender',
          id: 'male',
          buckets: [
            { id: 'range_2_5', count: 7970 },
            { id: 'range_10_20', count: 5470 },
            { id: 'range_5_10', count: 7362 },
          ]
        },

So you've added the "years of experience" breakdown for people who didn't answer the "gender" question.

But within each "years of experience" array of buckets, we also want to know how many people didn't answer the years of experience question. So the data we actually want for would be more like this:

facets: [
        {
          type: 'gender',
          id: 'noAnswer',
          buckets: [
            { id: 'range_5_10', count: 66 },
            { id: 'range_10_20', count: 49 },
            { id: 'range_less_than_1', count: 20 },
            { id: 'no_answer', count: 123 }, // people who didn't answer gender OR years of experience
          ]
        },
        {
          type: 'gender',
          id: 'not_listed',
          buckets: [
            { id: 'range_2_5', count: 35 },
            { id: 'range_5_10', count: 34 },
            { id: 'range_10_20', count: 39 },
            { id: 'no_answer', count: 123 }, // people who picked "not_listed" as gender but didn't answer "years of experience"
          ]
        },

Additionally we want this no_answer bucket to appear even when people don't select any facet. So we also want this:

"facets": [
              {
                "id": "default", // this is what we get when no facet is selected
                "buckets": [
                  {
                    "id": "range_less_than_1",
                    "count": 1272,
                  },
                  {
                    "id": "range_1_2",
                    "count": 4177,
                  },
                  {
                    "id": "range_2_5",
                    "count": 8710,
                  },
                  {
                    "id": "no_answer",
                    "count": 123,
                  },

@@ -9,6 +9,7 @@ age:
- range_more_than_65

years_of_experience:
- no_answer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because every field will need a no_answer key I don't think it makes sense to add it here, it's probably better to add it directly in generic.ts somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we actually need to add it though, or maybe only at the GraphQL level… at least I don't think we use those keys in the pipeline? I'll double check.

Copy link
Author

@viktors264 viktors264 Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously

This is a good start! But it's missing a key feature, which is that the no_answer key should be added to buckets, not just facets.

Generic pipeline updated. Added "no_answer" key to the buckets section. Also, when no facets is selected, I tested and got results as you wrote before.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because every field will need a no_answer key I don't think it makes sense to add it here, it's probably better to add it directly in generic.ts somewhere.

Yes, I absolutely agree. If 'no_answer' key is needed for every field, better to add it in generic.ts or at the GraphQL level (if keys.yml are not used in pipeline). I have checked - don't found straight use of keys in Mongo aggregation, but you better check.

// $unwind: {
// path: `$${facetPath}`
// }
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can safely remove this whole step then?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed comments. Changed aggregation.

@SachaG
Copy link
Member

SachaG commented Jan 18, 2023

Screen Shot 2023-01-18 at 9 08 25

By the way, that no_answer bucket already appears in the survey results, but currently it's manually calculated in the chart itself (number of total respondents - sum of respondents in the other columns). I think it would be cleaner to do it at the API level.

(Also I guess it wouldn't be too hard to do it outside the aggregation pipeline in the rest of the JS code if the pipeline can't easily do it)

@viktors264
Copy link
Author

Screen Shot 2023-01-18 at 9 08 25

By the way, that no_answer bucket already appears in the survey results, but currently it's manually calculated in the chart itself (number of total respondents - sum of respondents in the other columns). I think it would be cleaner to do it at the API level.

(Also I guess it wouldn't be too hard to do it outside the aggregation pipeline in the rest of the JS code if the pipeline can't easily do it)

Yes, of course - better to make calculations inside API.

@Devographics Devographics deleted a comment from vercel bot Jan 18, 2023
@SachaG
Copy link
Member

SachaG commented Jan 18, 2023

Good progress! But now I'm running into a different issue. It doesn't work when querying for a field where people can pick multiple options at the same time.

For example with the following GraphQL query:

query raceEthnicityQuery {
    survey(survey: state_of_js) {
        demographics {
            race_ethnicity: race_ethnicity(filters: {}, options: {}) {
                keys
                year(year: 2022) {
                    year
                    completion {
                        total
                        percentage_survey
                        count
                    }
                    facets {
                        id
                        type
                        completion {
                            total
                            percentage_question
                            percentage_survey
                            count
                        }
                        buckets {
                            id
                            count
                            percentage_question
                            percentage_survey
                        }
                    }
                }
            }
            
        }
    }
}

I get this:

results: [
    {
      facets: [
        {
          type: 'default',
          id: 'default',
          buckets: [
            { id: [ 'multiracial', 'white_european' ], count: 33 },
            {
              id: [
                'black_african',
                'east_asian',
                'hispanic_latin',
                'middle_eastern',
                'multiracial',
                'native_american_islander_australian',
                'south_asian',
                'south_east_asian'
              ],
              count: 1
            },
            {
              id: [ 'multiracial', 'hispanic_latin', 'white_european' ],
              count: 2
            },
            {
              id: [ 'multiracial', 'white_european', 'middle_eastern' ],
              count: 2
            },
            { id: [ 'east_asian', 'multiracial' ], count: 1 },
            {
              id: [ 'south_east_asian', 'south_asian', 'east_asian' ],
              count: 3
            },
            {
              id: [
                'black_african',
                'east_asian',
                'hispanic_latin',
                'middle_eastern',
                'native_american_islander_australian',
                'multiracial',
                'south_asian',
                'south_east_asian',
                'white_european',
                'not_listed'
              ],
              count: 1
            },
            { id: [ 'south_east_asian' ], count: 1000 },
            { id: [ 'multiracial', 'south_east_asian' ], count: 1 },
            {
              id: [
                'east_asian',
                'native_american_islander_australian',
                'south_asian',
                'white_european'
              ],
              count: 1
            },
etc.

As you can see it's using every existing combination of answers as a unique id key instead of aggregating them. The correct output (from main branch) would be:

  results: [
    {
      facets: [
        {
          type: 'default',
          id: 'default',
          buckets: [
            { id: 'multiracial', count: 727 },
            { id: 'east_asian', count: 1710 },
            { id: 'white_european', count: 19790 },
            { id: 'middle_eastern', count: 1158 },
            { id: 'hispanic_latin', count: 2795 },
            { id: 'south_asian', count: 1731 },
            { id: 'native_american_islander_australian', count: 142 },
            { id: 'not_listed', count: 795 },
            { id: 'south_east_asian', count: 1221 },
            { id: 'black_african', count: 1074 }
          ]
        }
      ],
      year: 2022
    }
  ]
}

@vercel
Copy link

vercel bot commented Jan 18, 2023

Someone is attempting to deploy a commit to the Devographics Team on Vercel.

A member of the Team first needs to authorize it.

@viktors264
Copy link
Author

viktors264 commented Jan 18, 2023

Good progress! But now I'm running into a different issue. It doesn't work when querying for a field where people can pick multiple options at the same time.

I have added back unwind operator with specific option which not skip nullable/empty fields. Seems, that we cannot remove unwind operator. Tested your case, working fine now, tested previous cases locally also - seems working for me.
For me difficult to know and test all cases, but let me know if something is wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants