Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible performance degradation for high cardinality columns in Contingency Similarity (affecting Quality Report) #589

Open
npatki opened this issue Jun 12, 2024 · 0 comments
Labels
bug Something isn't working feature:metrics Related to any of the individual metrics feature:reports Related to any of the generated reports

Comments

@npatki
Copy link
Contributor

npatki commented Jun 12, 2024

Environment Details

  • SDMetrics version: 0.14.1

Error Description

In the Quality Report, the Column Pair Trends and Intertable Trends properties both use the ContingencySimilarity metric to compute a score.

This underlying metric's performance may not be optimized when a column has extremely high cardinality. If you are computing between two columns A and B, then this metric computes the cross-tabulation of the two columns based on cardinality. Eg: If Column A is categorical with cardinality of a, and column B is also categorical with cardinality of b, then the Contingency Table will contain a x b values. This may end up being slow if a or b is really large.

Additional Context

We are not interested in replacing ContingencySimilarity with another metric. Rather, we should optimize its performance. Some ideas include:

  • looking at the base operations for cross tabulation and figuring out if there are any faster ones
  • taking a random subset
  • considering the top n most frequently occurring categories for the cross tabulation (where "top n" is calculated based on only the real data and the same exact set of n categories is used for the synthetic data)
  • etc.

Any solution will have to be vetted to ensure that the overall quality score being returned does not differ too much from the status quo.

@npatki npatki added bug Something isn't working feature:reports Related to any of the generated reports feature:metrics Related to any of the individual metrics labels Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:metrics Related to any of the individual metrics feature:reports Related to any of the generated reports
Projects
None yet
Development

No branches or pull requests

1 participant