Consider frequency when ordering categories in LabelEncoder
#611
Labels
feature request
Request for a new feature
LabelEncoder
#611
Problem Description
If I have unordered categorical data (aka nominal data), then it doesn't theoretically matter how the LabelEncoder decides to order the categories.
However in practice, certain order are better than others. In particular, an ascending-descending pattern of frequency will allow the data to more closely resemble a bell-curve, which is useful for data science.
Expected behavior
Add another option for the
order_by
parameter called'frequency_inverted_v'
(name TBD).When set, the transformer should
Additional context
Empirically, this seems to produce drastically better results than the default.
Default ordering: Order is assigned first-come, first-serve
V-shaped ordering: Order is assigned by frequency, in an inverted V shape to resemble a bell-shaped distribution.
One way to accomplish this is by sorting the categories by frequency and then assigning them in an alternating fashion from the middle out.
The text was updated successfully, but these errors were encountered: