Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add label encoder #5067

Merged
merged 8 commits into from
Jul 1, 2020
Merged

Add label encoder #5067

merged 8 commits into from
Jul 1, 2020

Conversation

LiuYuHui
Copy link
Contributor

Related to #5054
Add LabelEncoder/MulticlassLabelsEncoder/BinaryLabelEncoder

transformed_vec.begin(), transformed_vec.end(),
transformed_vec.begin(), [](float64_t e) {
if (std::abs(e - 0.0) <=
std::numeric_limits<float64_t>::epsilon())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Math::fequals does this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz dont change it to Math:: ;) we wanna kill that thing sooner or later

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I guess we need to replace this with a utility function somewhere then? Because this is quite a common operation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be an issue, and then while we are at it can also replace all the CMath::fequals calls

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should either introduce this utility function now, or just use CMath::fequals and include this in a larger refactor (it is copy paste)

"BinaryLabel should contain only two elements");

return SGVector<float64_t>(
unique_labels.begin(), unique_labels.end());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the result of fit_impl?

TEST(BinaryLabelEncoder, fit_transform)
{
auto label_encoder = std::make_shared<BinaryLabelEncoder>();
SGVector<int32_t> vec{-1, -1, 1, -1, 1};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about something with values that are not {-1,1}. Could you also test for something with more than two labels to make sure the exception is thrown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if values are not {-1, 1}, they will be transformed to {-1, 1}, i will add more unite test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you want to test

  • binary labels are passed (no-op)
  • labels with two unique values are passed (and transformed into binary)
  • labels with fewer or more than two unique values are passed (error)

TEST(MulticlassLabelsEncoder, fit_transform)
{
auto label_encoder = std::make_shared<MulticlassLabelsEncoder>();
SGVector<float64_t> vec{1, 2, 2, 6};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming something with negative values and 0 would also work?


auto inv_result = label_encoder->inverse_transform(result_labels)
->as<MulticlassLabels>()
->get_labels();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also test for something that isn't the result? For example something like {2,0,1,0}?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what happens when you transform with something that wasn't fitted? For example if now there was a labels 3, but your labels space is {1,2,6}?

Copy link
Member

@gf712 gf712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start! I have some questions. But I think I just misunderstood the design. It would be good to add more tests and think of all possible corner cases

std::transform(
result_vector.begin(), result_vector.end(),
original_vector.begin(),
[& normalized_to_origin = normalized_to_origin](const auto& e) {
Copy link
Member

@gf712 gf712 Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw if you don't rename a variable you can just write it as [&normalized_to_origin]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalized_to_origin is a class member, so if I want to use normalized_to_origin in lambda, I guess I have to write [& normalized_to_origin = normalized_to_origin]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, you're right!

require(
std::set<float64_t>(result_vector.begin(), result_vector.end())
.size() == 2,
"BinaryLabel should contain only two elements");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you pls re-use the code we have for these transformations? I put some links in the issue.
Or at least then delete the old code

require(
std::set<float64_t>(result_vector.begin(), result_vector.end())
.size() == 2,
"BinaryLabel should contain only two elements");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"BinaryLabel" -> "Binary labels".
And also please print what the values are so that the user can see what is wrong. See also the existing code for that

std::set<float64_t>(
normalized_vector.begin(), normalized_vector.end())
.size() == 2,
"BinaryLabel should contain only two elements");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above

auto result_labels = fit_impl(result_vector);
require(
unique_labels.size() == 2,
"Binary Labels should contain only two elements");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: "Binary labels" (the idea is to not replicate class/variable/type names in user facing error messages).
Still would be good to print the unique labels in this error message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the amount of the unique labels may be exceed more than two, such as if {0,1,2,3,4,5,6,7,8} is passed to the method, should we print all values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah the idea is that the user gets a feedback what is the the problem with his input. i.e. you print out that we've detected {0,1,2} as unique labels in his data set... this should help one to figure out/fix the input...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ what viktor says here.
Also keep in mind that the unique number of labels will be relatively small (unless someone passes regression labels .... so maybe you could cap them at length 10 or so and make a "...", although I don't really think that is necessary)

std::unordered_set<float64_t>(
normalized_vector.begin(), normalized_vector.end())
.size() == 2,
"Binary Labels should contain only two elements");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we maybe put this string somewhere so it is not replicated so many times.... as well as the check code itself?

auto inv_test = label_encoder->inverse_transform(test_labels)
->as<BinaryLabels>()
->get_labels();
EXPECT_EQ(-100, inv_test[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you use a loop (or macro) and the original vector so make this less verbose?

Copy link
Member

@karlnapf karlnapf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I think there should be a warning printed if a user passes labels that are not integers. E.g. {2.13434, 3.42342343} shouldnt be mapped to {-1,+1} without warning

@LiuYuHui
Copy link
Contributor Author

BTW I think there should be a warning printed if a user passes labels that are not integers. E.g. {2.13434, 3.42342343} shouldnt be mapped to {-1,+1} without warning

but it seems like the Labels class use SGVector<float64_t> to store the values, even the values are SGVector<int32_t>, so i think there is no way to figure out what the values are, because those values have been converted to float64_t.

@gf712
Copy link
Member

gf712 commented Jun 16, 2020

BTW I think there should be a warning printed if a user passes labels that are not integers. E.g. {2.13434, 3.42342343} shouldnt be mapped to {-1,+1} without warning

but it seems like the Labels class use SGVector<float64_t> to store the values, even the values are SGVector<int32_t>, so i think there is no way to figure out what the values are, because those values have been converted to float64_t.

But you can cast the elements to int and then compare the difference between element and integer cast right? If that difference is larger than epsilon then there is an issue, i.e. it does not represent an integer. You could do this when you populate the set. It will slow down the encoding fitting massively, but this shouldn't be the performance critical part anyway

unique_set.size() == 2,
"Binary labels should contain only two elements, ({}) have "
"been detected.",
fmt::join(unique_set, ", "));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat! Didn't know this was a thing :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Cannot interpret {} as binary labels"

Copy link
Member

@karlnapf karlnapf Jun 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo can not -> cannot

[](auto&& e1, auto&& e2) {
return std::abs(e1 - e2) >
std::numeric_limits<float64_t>::epsilon();
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to rename this function, it is a bit confusing what you get back. Imo it should be can_convert_float_to_int and then you just have to invert the logic to std::abs(e1 - e2) < std::numeric_limits<float64_t>::epsilon();, and then fix the logic in each call

{
auto label_encoder = std::make_shared<BinaryLabelEncoder>();
SGVector<int32_t> vec{-100, 200, -100, 200, -100, 42};
auto origin_labels = std::make_shared<BinaryLabels>(vec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karlnapf doesn't BinaryLabels throw in this situation? And if not shouldn't it?
Also shouldn't this test be done with MulticlassLabels?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesnt atm, but I am not sure.

auto label_encoder = std::make_shared<MulticlassLabelsEncoder>();
SGVector<float64_t> vec{-100.1, 200.4, -2.868, 6.98, -2.868};
auto origin_labels = std::make_shared<MulticlassLabels>(vec);
auto unique_vec = label_encoder->fit(origin_labels);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so will this give the user a warning? Because they are using non-integer representations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should (though maybe it should be possible to de-activate it? not sure if this can be spamy)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this would only happen once when you fit thought no? And the user can also change the io level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in grid search it would happen quite a few times. But yes with the loglevel that is best taken care of

return "LabelEncoder";
}

void set_print_warning(bool print_warning){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo this shouldn’t be needed. The user should set this with log level, eg switch from warn to error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

std::transform(
normalized_vector.begin(), normalized_vector.end(),
normalized_vector.begin(), [](float64_t e) {
if (std::abs(e + 1.0) <=
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Math::fequals or utility func

[](auto&& e) { return static_cast<int32_t>(e); });
return std::make_shared<BinaryLabels>(result_vev);
}
/** Fit label encoder and return encoded labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base class

std::unordered_set<float64_t>(vec.begin(), vec.end());
require(
unique_set.size() == 2,
"Binary labels should contain only two elements, can not interpret ({}) as binary labels",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Binary labels should contain only two elements, can not interpret ({}) as binary labels",
"Cannot interpret ({}) as binary labels, need exactly two classes.",

const auto unique_set =
std::unordered_set<float64_t>(vec.begin(), vec.end());
require(
unique_set.size() == 2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gf712 I think it might not be good to assert this as sometimes labels might only contain one class. But I guess this will pop up if a problem and we can change it then :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, in what situation would there only be one label?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When predicting, although I am not sure this is ever called in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this is only called in fit and transform, so should be fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep + 1 and if it becomes a problem we just change it

* @return original encoding labels
*/
std::shared_ptr<Labels>
inverse_transform(const std::shared_ptr<Labels>& labs) override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we also need a check valid here? Something that ensures that the labels are contiguous? [0,1,2,3,4, ... ] no gaps.
I wrote some code for this in the links I posted. Either re-use or remove my code :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking whether we need a check valid here, as inverse_transform is to map from internal encoding to origin encoding. for example, {100, 100, 200, 300} -> {0, 0, 1, 2}, {0, 0, 1, 2} are transformed by internal encoding, but it is not continuous

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you are right of course :)

namespace shogun
{

class MulticlassLabelsEncoder : public LabelEncoder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as for binary labels class

return std::equal(
vec.begin(), vec.end(), converted.begin(),
[](auto&& e1, auto&& e2) {
return std::abs(e1 - e2) <
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMath::fequals or utility pls :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LiuYuHui could you do this change so that when CMath::fequals is replaced we can just find and replace these things?

}

std::set<float64_t> unique_labels;
std::unordered_map<float64_t, float64_t> normalized_to_origin;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inverse_mapping

Copy link
Member

@karlnapf karlnapf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some more comments. Nice work :)

@@ -16,37 +16,33 @@
#include <unordered_set>
namespace shogun
{

/** @brief Implements a reversible mapping from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespaces are weird here. Could you clean up?

/** @brief Implements a reversible mapping from any
* form of labels to one of Shogun's target label spaces
* (binary, multi-class, etc).
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this line

Copy link
Member

@karlnapf karlnapf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, looks good. What is missing?

@LiuYuHui
Copy link
Contributor Author

Cool, looks good. What is missing?

i think all things have been done.

return original_vector;
}

bool can_convert_float_to_int(const SGVector<float64_t>& vec) const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether this should be templated (for both float and int type) and live somewhere where other conversion tools (safe_convert) live...this might be useful elesewhere
@gf712 thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally this would be templated


namespace shogun
{
/** @brief Implements a reversible mapping from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some whitespace glitches

Copy link
Member

@karlnapf karlnapf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering: would it make sense to avoid doing any conversion if the labels are already in the correct format? (i.e. -1,1 or 0,1,2,3,4 ....,num_classes-1)? Or is that a pointless optimization as not making a difference anyways?

@LiuYuHui
Copy link
Contributor Author

Just wondering: would it make sense to avoid doing any conversion if the labels are already in the correct format? (i.e. -1,1 or 0,1,2,3,4 ....,num_classes-1)? Or is that a pointless optimization as not making a difference anyways?

I think we should not add this optimization, when {-1, 1} or {0, 1, 2, num_classes-1} are passed in fit/transform, we still need to maintain the mapping, as we don't know what values will be passed in inverse_transform.

@karlnapf
Copy link
Member

Just wondering: would it make sense to avoid doing any conversion if the labels are already in the correct format? (i.e. -1,1 or 0,1,2,3,4 ....,num_classes-1)? Or is that a pointless optimization as not making a difference anyways?

I think we should not add this optimization, when {-1, 1} or {0, 1, 2, num_classes-1} are passed in fit/transform, we still need to maintain the mapping, as we don't know what values will be passed in inverse_transform.

Well that could be dealt with ... say if no inverse mapping was computed, the labels are not mapped backwards, but simply used as returned by the machine. @gf712 @vigsterkr what are your thoughts?

@karlnapf
Copy link
Member

I think otherwise, we can merge this. Any objections (apart from the question above)?

@gf712
Copy link
Member

gf712 commented Jun 28, 2020

Just wondering: would it make sense to avoid doing any conversion if the labels are already in the correct format? (i.e. -1,1 or 0,1,2,3,4 ....,num_classes-1)? Or is that a pointless optimization as not making a difference anyways?

I think we should not add this optimization, when {-1, 1} or {0, 1, 2, num_classes-1} are passed in fit/transform, we still need to maintain the mapping, as we don't know what values will be passed in inverse_transform.

Well that could be dealt with ... say if no inverse mapping was computed, the labels are not mapped backwards, but simply used as returned by the machine. @gf712 @vigsterkr what are your thoughts?

Yes, makes sense to me to have this, but then you need to check if the encoder has been fitted (boolean class member) and then you can see if the map is empty -> if empty noop

@karlnapf
Copy link
Member

BTW once the above discussion is resolved, we could merge this.
But there is a follow up thing to do (could go into a single PR)

  • embed this into Machine
  • remove the old conversion methods
  • all machine implementations can then perform dynamic casts and assume the correct label type is given to them

And then another PR for using all this within xvalidation


std::set<float64_t> unique_labels;
std::unordered_map<float64_t, float64_t> inverse_mapping;
const float64_t eps = std::numeric_limits<float64_t>::epsilon();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const float64_t eps = std::numeric_limits<float64_t>::epsilon();
constexpr float64_t eps = std::numeric_limits<float64_t>::epsilon();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I changed const to constexpr, I got an error: non-static data member ‘eps’ declared ‘constexpr’

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, it should be static constexpr float64_t eps = std::numeric_limits<float64_t>::epsilon();

@gf712
Copy link
Member

gf712 commented Jul 1, 2020

I think this can merged now :)

@karlnapf karlnapf merged commit 1f1f7d8 into shogun-toolbox:develop Jul 1, 2020
@karlnapf
Copy link
Member

karlnapf commented Jul 1, 2020

Great! :)

As a next step, can I suggest a PR that

  • uses this class inside Machine::train
  • removes all the conversion code in BinaryLabels and MulticlassLabels
  • replaces all usages of binary_labels and multiclass_labels conversion methods in train_machine methods with ->as<BinaryLabels>() etc? I.e. a simple dynamic cast
  • Fixes all the examples/tests that will break through that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants