-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add label encoder #5067
Add label encoder #5067
Conversation
transformed_vec.begin(), transformed_vec.end(), | ||
transformed_vec.begin(), [](float64_t e) { | ||
if (std::abs(e - 0.0) <= | ||
std::numeric_limits<float64_t>::epsilon()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Math::fequals
does this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz dont change it to Math:: ;) we wanna kill that thing sooner or later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I guess we need to replace this with a utility function somewhere then? Because this is quite a common operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be an issue, and then while we are at it can also replace all the CMath::fequals calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should either introduce this utility function now, or just use CMath::fequals and include this in a larger refactor (it is copy paste)
"BinaryLabel should contain only two elements"); | ||
|
||
return SGVector<float64_t>( | ||
unique_labels.begin(), unique_labels.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this the result of fit_impl
?
TEST(BinaryLabelEncoder, fit_transform) | ||
{ | ||
auto label_encoder = std::make_shared<BinaryLabelEncoder>(); | ||
SGVector<int32_t> vec{-1, -1, 1, -1, 1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about something with values that are not {-1,1}. Could you also test for something with more than two labels to make sure the exception is thrown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if values are not {-1, 1}, they will be transformed to {-1, 1}, i will add more unite test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you want to test
- binary labels are passed (no-op)
- labels with two unique values are passed (and transformed into binary)
- labels with fewer or more than two unique values are passed (error)
TEST(MulticlassLabelsEncoder, fit_transform) | ||
{ | ||
auto label_encoder = std::make_shared<MulticlassLabelsEncoder>(); | ||
SGVector<float64_t> vec{1, 2, 2, 6}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming something with negative values and 0 would also work?
|
||
auto inv_result = label_encoder->inverse_transform(result_labels) | ||
->as<MulticlassLabels>() | ||
->get_labels(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you also test for something that isn't the result? For example something like {2,0,1,0}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what happens when you transform with something that wasn't fitted? For example if now there was a labels 3, but your labels space is {1,2,6}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start! I have some questions. But I think I just misunderstood the design. It would be good to add more tests and think of all possible corner cases
src/shogun/labels/LabelEncoder.h
Outdated
std::transform( | ||
result_vector.begin(), result_vector.end(), | ||
original_vector.begin(), | ||
[& normalized_to_origin = normalized_to_origin](const auto& e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw if you don't rename a variable you can just write it as [&normalized_to_origin]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normalized_to_origin is a class member, so if I want to use normalized_to_origin in lambda, I guess I have to write [& normalized_to_origin = normalized_to_origin]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, you're right!
require( | ||
std::set<float64_t>(result_vector.begin(), result_vector.end()) | ||
.size() == 2, | ||
"BinaryLabel should contain only two elements"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you pls re-use the code we have for these transformations? I put some links in the issue.
Or at least then delete the old code
require( | ||
std::set<float64_t>(result_vector.begin(), result_vector.end()) | ||
.size() == 2, | ||
"BinaryLabel should contain only two elements"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"BinaryLabel" -> "Binary labels".
And also please print what the values are so that the user can see what is wrong. See also the existing code for that
std::set<float64_t>( | ||
normalized_vector.begin(), normalized_vector.end()) | ||
.size() == 2, | ||
"BinaryLabel should contain only two elements"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above
auto result_labels = fit_impl(result_vector); | ||
require( | ||
unique_labels.size() == 2, | ||
"Binary Labels should contain only two elements"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: "Binary labels" (the idea is to not replicate class/variable/type names in user facing error messages).
Still would be good to print the unique labels in this error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the amount of the unique labels may be exceed more than two, such as if {0,1,2,3,4,5,6,7,8}
is passed to the method, should we print all values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah the idea is that the user gets a feedback what is the the problem with his input. i.e. you print out that we've detected {0,1,2}
as unique labels in his data set... this should help one to figure out/fix the input...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ what viktor says here.
Also keep in mind that the unique number of labels will be relatively small (unless someone passes regression labels .... so maybe you could cap them at length 10 or so and make a "...", although I don't really think that is necessary)
std::unordered_set<float64_t>( | ||
normalized_vector.begin(), normalized_vector.end()) | ||
.size() == 2, | ||
"Binary Labels should contain only two elements"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we maybe put this string somewhere so it is not replicated so many times.... as well as the check code itself?
auto inv_test = label_encoder->inverse_transform(test_labels) | ||
->as<BinaryLabels>() | ||
->get_labels(); | ||
EXPECT_EQ(-100, inv_test[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you use a loop (or macro) and the original vector so make this less verbose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I think there should be a warning printed if a user passes labels that are not integers. E.g. {2.13434, 3.42342343} shouldnt be mapped to {-1,+1} without warning
but it seems like the |
But you can cast the elements to int and then compare the difference between element and integer cast right? If that difference is larger than epsilon then there is an issue, i.e. it does not represent an integer. You could do this when you populate the set. It will slow down the encoding fitting massively, but this shouldn't be the performance critical part anyway |
unique_set.size() == 2, | ||
"Binary labels should contain only two elements, ({}) have " | ||
"been detected.", | ||
fmt::join(unique_set, ", ")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat! Didn't know this was a thing :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Cannot interpret {} as binary labels"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo can not -> cannot
[](auto&& e1, auto&& e2) { | ||
return std::abs(e1 - e2) > | ||
std::numeric_limits<float64_t>::epsilon(); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to rename this function, it is a bit confusing what you get back. Imo it should be can_convert_float_to_int
and then you just have to invert the logic to std::abs(e1 - e2) < std::numeric_limits<float64_t>::epsilon();
, and then fix the logic in each call
{ | ||
auto label_encoder = std::make_shared<BinaryLabelEncoder>(); | ||
SGVector<int32_t> vec{-100, 200, -100, 200, -100, 42}; | ||
auto origin_labels = std::make_shared<BinaryLabels>(vec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karlnapf doesn't BinaryLabels throw in this situation? And if not shouldn't it?
Also shouldn't this test be done with MulticlassLabels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesnt atm, but I am not sure.
auto label_encoder = std::make_shared<MulticlassLabelsEncoder>(); | ||
SGVector<float64_t> vec{-100.1, 200.4, -2.868, 6.98, -2.868}; | ||
auto origin_labels = std::make_shared<MulticlassLabels>(vec); | ||
auto unique_vec = label_encoder->fit(origin_labels); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so will this give the user a warning? Because they are using non-integer representations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should (though maybe it should be possible to de-activate it? not sure if this can be spamy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this would only happen once when you fit thought no? And the user can also change the io level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in grid search it would happen quite a few times. But yes with the loglevel that is best taken care of
src/shogun/labels/LabelEncoder.h
Outdated
return "LabelEncoder"; | ||
} | ||
|
||
void set_print_warning(bool print_warning){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo this shouldn’t be needed. The user should set this with log level, eg switch from warn to error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
std::transform( | ||
normalized_vector.begin(), normalized_vector.end(), | ||
normalized_vector.begin(), [](float64_t e) { | ||
if (std::abs(e + 1.0) <= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Math::fequals
or utility func
[](auto&& e) { return static_cast<int32_t>(e); }); | ||
return std::make_shared<BinaryLabels>(result_vev); | ||
} | ||
/** Fit label encoder and return encoded labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base class
std::unordered_set<float64_t>(vec.begin(), vec.end()); | ||
require( | ||
unique_set.size() == 2, | ||
"Binary labels should contain only two elements, can not interpret ({}) as binary labels", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Binary labels should contain only two elements, can not interpret ({}) as binary labels", | |
"Cannot interpret ({}) as binary labels, need exactly two classes.", |
const auto unique_set = | ||
std::unordered_set<float64_t>(vec.begin(), vec.end()); | ||
require( | ||
unique_set.size() == 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gf712 I think it might not be good to assert this as sometimes labels might only contain one class. But I guess this will pop up if a problem and we can change it then :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, in what situation would there only be one label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When predicting, although I am not sure this is ever called in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this is only called in fit and transform, so should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep + 1 and if it becomes a problem we just change it
* @return original encoding labels | ||
*/ | ||
std::shared_ptr<Labels> | ||
inverse_transform(const std::shared_ptr<Labels>& labs) override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we also need a check valid here? Something that ensures that the labels are contiguous? [0,1,2,3,4, ... ] no gaps.
I wrote some code for this in the links I posted. Either re-use or remove my code :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking whether we need a check valid here, as inverse_transform is to map from internal encoding to origin encoding. for example, {100, 100, 200, 300} -> {0, 0, 1, 2}, {0, 0, 1, 2} are transformed by internal encoding, but it is not continuous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah you are right of course :)
namespace shogun | ||
{ | ||
|
||
class MulticlassLabelsEncoder : public LabelEncoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comments as for binary labels class
src/shogun/labels/LabelEncoder.h
Outdated
return std::equal( | ||
vec.begin(), vec.end(), converted.begin(), | ||
[](auto&& e1, auto&& e2) { | ||
return std::abs(e1 - e2) < |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMath::fequals or utility pls :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LiuYuHui could you do this change so that when CMath::fequals
is replaced we can just find and replace these things?
src/shogun/labels/LabelEncoder.h
Outdated
} | ||
|
||
std::set<float64_t> unique_labels; | ||
std::unordered_map<float64_t, float64_t> normalized_to_origin; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inverse_mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some more comments. Nice work :)
@@ -16,37 +16,33 @@ | |||
#include <unordered_set> | |||
namespace shogun | |||
{ | |||
|
|||
/** @brief Implements a reversible mapping from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whitespaces are weird here. Could you clean up?
src/shogun/labels/LabelEncoder.h
Outdated
/** @brief Implements a reversible mapping from any | ||
* form of labels to one of Shogun's target label spaces | ||
* (binary, multi-class, etc). | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, looks good. What is missing?
i think all things have been done. |
return original_vector; | ||
} | ||
|
||
bool can_convert_float_to_int(const SGVector<float64_t>& vec) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether this should be templated (for both float and int type) and live somewhere where other conversion tools (safe_convert) live...this might be useful elesewhere
@gf712 thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ideally this would be templated
|
||
namespace shogun | ||
{ | ||
/** @brief Implements a reversible mapping from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some whitespace glitches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering: would it make sense to avoid doing any conversion if the labels are already in the correct format? (i.e. -1,1 or 0,1,2,3,4 ....,num_classes-1)? Or is that a pointless optimization as not making a difference anyways?
I think we should not add this optimization, when {-1, 1} or {0, 1, 2, num_classes-1} are passed in |
Well that could be dealt with ... say if no inverse mapping was computed, the labels are not mapped backwards, but simply used as returned by the machine. @gf712 @vigsterkr what are your thoughts? |
I think otherwise, we can merge this. Any objections (apart from the question above)? |
Yes, makes sense to me to have this, but then you need to check if the encoder has been fitted (boolean class member) and then you can see if the map is empty -> if empty noop |
BTW once the above discussion is resolved, we could merge this.
And then another PR for using all this within xvalidation |
src/shogun/labels/LabelEncoder.h
Outdated
|
||
std::set<float64_t> unique_labels; | ||
std::unordered_map<float64_t, float64_t> inverse_mapping; | ||
const float64_t eps = std::numeric_limits<float64_t>::epsilon(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const float64_t eps = std::numeric_limits<float64_t>::epsilon(); | |
constexpr float64_t eps = std::numeric_limits<float64_t>::epsilon(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I changed const
to constexpr
, I got an error: non-static data member ‘eps’ declared ‘constexpr’
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, it should be static constexpr float64_t eps = std::numeric_limits<float64_t>::epsilon();
I think this can merged now :) |
Great! :) As a next step, can I suggest a PR that
|
Related to #5054
Add LabelEncoder/MulticlassLabelsEncoder/BinaryLabelEncoder