-
Notifications
You must be signed in to change notification settings - Fork 149
Description
Had a look into the implementation -- the actual column comparison code (
def columns_equal) seems rather unflexible/specifically built for use cases at Capital One. Here's two ideas how to deal with the NumPy array issue:A) Add new fixed logic for NumPy arrays: try to detect NumPy array columns by looking at the actual series values. Use
.all()for NumPy arrays.B) Add a new system for custom declaration of "comparators", ie. give more flexibility to the user to configure how columns are compared. We would ship a default configuration that mimics the current behavior, and users would be free to change the configuration to their liking. This could be as simple as giving a list of comparators that are tried in order until one of them "understand" the data, ie. the user could pass something like:
columns_equal(..., comparators=[ FloatComparator(rtol=1e-3), StringComparator(case_sensitive=False), ArrayComparator(aggregate="all") # calls .all() ])Or it could be an explicit list of comparators for each column, or something similar.
Originally posted by @jonashaag in #58
This was a old issue from a while back but want to revist and get some thoughts from folks on if this is something we should look into and or persue. the idea of compartmentalizing "comparators" from a design perspective feels cleaner and nice. This could also allow people to build out their own custom ones and tweak to their liking.
tagging the @capitalone/datacompy-write-team for their thoughts and opinions on this.
- The initial set could use dispatching to house logic for all supported data types within say
FloatComparator. - Some builtins could be:
- Numeric
- String
- Date/String
- Temporal