Possible performance degradation for high cardinality columns in Contingency Similarity (affecting Quality Report) #589
Labels
bug
Something isn't working
feature:metrics
Related to any of the individual metrics
feature:reports
Related to any of the generated reports
Environment Details
Error Description
In the Quality Report, the Column Pair Trends and Intertable Trends properties both use the ContingencySimilarity metric to compute a score.
This underlying metric's performance may not be optimized when a column has extremely high cardinality. If you are computing between two columns A and B, then this metric computes the cross-tabulation of the two columns based on cardinality. Eg: If Column A is categorical with cardinality of
a
, and column B is also categorical with cardinality ofb
, then the Contingency Table will containa x b
values. This may end up being slow ifa
orb
is really large.Additional Context
We are not interested in replacing
ContingencySimilarity
with another metric. Rather, we should optimize its performance. Some ideas include:Any solution will have to be vetted to ensure that the overall quality score being returned does not differ too much from the status quo.
The text was updated successfully, but these errors were encountered: