Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add label encoder #5067

Merged
merged 8 commits into from
Jul 1, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 142 additions & 0 deletions src/shogun/labels/BinaryLabelEncoder.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
/*
* This software is distributed under BSD 3-clause license (see LICENSE file).
*
* Authors: Yuhui Liu
*
*/
#ifndef _BINARYLABELENCODER__H__
#define _BINARYLABELENCODER__H__

#include <memory>
#include <shogun/base/SGObject.h>
#include <shogun/labels/BinaryLabels.h>
#include <shogun/labels/DenseLabels.h>
#include <shogun/labels/LabelEncoder.h>
#include <shogun/lib/SGVector.h>
#include <unordered_set>
namespace shogun
{

class BinaryLabelEncoder : public LabelEncoder
LiuYuHui marked this conversation as resolved.
Show resolved Hide resolved
{
public:
BinaryLabelEncoder() = default;

~BinaryLabelEncoder() = default;

/** Fit label encoder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this docstring to the base class and just leave it empty here (as it contains nothing relevant for this subclass)

*
* @param Target values.
* @return SGVector which contains unique labels.
*/
SGVector<float64_t> fit(const std::shared_ptr<Labels>& labs) override
{
const auto result_vector = labs->as<DenseLabels>()->get_labels();
check_is_valid(result_vector);
if (is_convert_float_to_int(result_vector))
{
io::warn(
"({}, {}) have been converted to (-1, 1).",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that always true? Wouldnt you want to print the actual conversion? I.e. is result_vector[0] always mapped to -1?

result_vector[0], result_vector[1]);
}
auto result_labels = fit_impl(result_vector);
return result_labels;
LiuYuHui marked this conversation as resolved.
Show resolved Hide resolved
}
/** Transform labels to normalized encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, move to base class and remove here

*
* @param Target values to be transformed.
* @return Labels transformed to be normalized.
*/
std::shared_ptr<Labels>
transform(const std::shared_ptr<Labels>& labs) override
{
const auto result_vector = labs->as<DenseLabels>()->get_labels();
check_is_valid(result_vector);
auto transformed_vec = transform_impl(result_vector);

std::transform(
transformed_vec.begin(), transformed_vec.end(),
transformed_vec.begin(), [](float64_t e) {
if (std::abs(e - 0.0) <=
std::numeric_limits<float64_t>::epsilon())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Math::fequals does this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz dont change it to Math:: ;) we wanna kill that thing sooner or later

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I guess we need to replace this with a utility function somewhere then? Because this is quite a common operation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be an issue, and then while we are at it can also replace all the CMath::fequals calls

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should either introduce this utility function now, or just use CMath::fequals and include this in a larger refactor (it is copy paste)

return -1.0;
else
return e;
LiuYuHui marked this conversation as resolved.
Show resolved Hide resolved
});
return std::make_shared<BinaryLabels>(transformed_vec);
}
/** Transform labels back to original encoding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base class

*
* @param normailzed encoding labels
* @return original encoding labels
*/
std::shared_ptr<Labels>
inverse_transform(const std::shared_ptr<Labels>& labs) override
{
auto normalized_labels = labs->as<BinaryLabels>();
normalized_labels->ensure_valid();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do? The same as the check_is_validbelow? I think it might require that at least two labels exist, so not sure that is appropriate here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I should remove check_is_valid and only use ensure_valid here, ensure_valid have more strict constraint as it ensures labels have only contained {-1, +1}.

auto normalized_vector = normalized_labels->get_labels();
check_is_valid(normalized_vector);
std::transform(
normalized_vector.begin(), normalized_vector.end(),
normalized_vector.begin(), [](float64_t e) {
if (std::abs(e + 1.0) <=
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Math::fequals or utility func

std::numeric_limits<float64_t>::epsilon())
return 0.0;
else
return e;
});
auto origin_vec = inverse_transform_impl(normalized_vector);
SGVector<int32_t> result_vev(origin_vec.vlen);
std::transform(
origin_vec.begin(), origin_vec.end(), result_vev.begin(),
[](auto&& e) { return static_cast<int32_t>(e); });
return std::make_shared<BinaryLabels>(result_vev);
}
/** Fit label encoder and return encoded labels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base class

*
* @param Target values.
* @return Labels transformed to be normalized.
*/
std::shared_ptr<Labels>
fit_transform(const std::shared_ptr<Labels>& labs) override
LiuYuHui marked this conversation as resolved.
Show resolved Hide resolved
{
const auto result_vector = labs->as<DenseLabels>()->get_labels();
return std::make_shared<BinaryLabels>(
transform_impl(fit_impl(result_vector)));
}

virtual const char* get_name() const
{
return "BinaryLabelEncoder";
}

private:
void check_is_valid(const SGVector<float64_t>& vec)
{
const auto unique_set =
std::unordered_set<float64_t>(vec.begin(), vec.end());
require(
unique_set.size() == 2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gf712 I think it might not be good to assert this as sometimes labels might only contain one class. But I guess this will pop up if a problem and we can change it then :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, in what situation would there only be one label?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When predicting, although I am not sure this is ever called in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this is only called in fit and transform, so should be fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep + 1 and if it becomes a problem we just change it

"Binary labels should contain only two elements, ({}) have "
"been detected.",
fmt::join(unique_set, ", "));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat! Didn't know this was a thing :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Cannot interpret {} as binary labels"

Copy link
Member

@karlnapf karlnapf Jun 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo can not -> cannot

}

bool is_convert_float_to_int(const SGVector<float64_t>& vec) const
{
SGVector<int32_t> converted(vec.vlen);
std::transform(
vec.begin(), vec.end(), converted.begin(),
[](auto&& e) { return static_cast<int32_t>(e); });
return std::equal(
vec.begin(), vec.end(), converted.begin(),
[](auto&& e1, auto&& e2) {
return std::abs(e1 - e2) >
std::numeric_limits<float64_t>::epsilon();
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to rename this function, it is a bit confusing what you get back. Imo it should be can_convert_float_to_int and then you just have to invert the logic to std::abs(e1 - e2) < std::numeric_limits<float64_t>::epsilon();, and then fix the logic in each call

}
};
} // namespace shogun

#endif
106 changes: 106 additions & 0 deletions src/shogun/labels/LabelEncoder.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
/*
* This software is distributed under BSD 3-clause license (see LICENSE file).
*
* Authors: Yuhui Liu
*
*/

#ifndef _LABELENCODER__H__
#define _LABELENCODER__H__

#include <algorithm>
#include <map>
#include <memory>
#include <set>
#include <shogun/base/SGObject.h>
#include <shogun/lib/SGVector.h>
namespace shogun
{

class LabelEncoder : public SGObject
LiuYuHui marked this conversation as resolved.
Show resolved Hide resolved
{
public:
LabelEncoder() = default;

virtual ~LabelEncoder() = default;

/** Fit label encoder
*
* @param Target values.
* @return SGVector which contains unique labels.
*/
virtual SGVector<float64_t>
fit(const std::shared_ptr<Labels>& labs) = 0;
/** Transform labels to normalized encoding.
*
* @param Target values to be transformed.
* @return Labels transformed to be normalized.
*/
virtual std::shared_ptr<Labels>
transform(const std::shared_ptr<Labels>& labs) = 0;
/** Transform labels back to original encoding.
*
* @param normailzed encoding labels
* @return original encoding labels
*/
virtual std::shared_ptr<Labels>
inverse_transform(const std::shared_ptr<Labels>&) = 0;

/** Fit label encoder and return encoded labels.
*
* @param Target values.
* @return Labels transformed to be normalized.
*/
virtual std::shared_ptr<Labels>
fit_transform(const std::shared_ptr<Labels>&) = 0;

virtual const char* get_name() const
{
return "LabelEncoder";
}

protected:
SGVector<float64_t> fit_impl(const SGVector<float64_t>& origin_vector)
{
std::copy(
origin_vector.begin(), origin_vector.end(),
std::inserter(unique_labels, unique_labels.begin()));
return SGVector<float64_t>(
unique_labels.begin(), unique_labels.end());
}

SGVector<float64_t>
transform_impl(const SGVector<float64_t>& result_vector)
{
SGVector<float64_t> converted(result_vector.vlen);
std::transform(
result_vector.begin(), result_vector.end(), converted.begin(),
[& unique_labels = unique_labels,
&normalized_to_origin =
normalized_to_origin](const auto& old_label) {
auto new_label = std::distance(
unique_labels.begin(), unique_labels.find(old_label));
normalized_to_origin[new_label] = old_label;
return new_label;
});
return converted;
}

SGVector<float64_t>
inverse_transform_impl(const SGVector<float64_t>& result_vector)
{
SGVector<float64_t> original_vector(result_vector.vlen);
std::transform(
result_vector.begin(), result_vector.end(),
original_vector.begin(),
[& normalized_to_origin = normalized_to_origin](const auto& e) {
Copy link
Member

@gf712 gf712 Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw if you don't rename a variable you can just write it as [&normalized_to_origin]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalized_to_origin is a class member, so if I want to use normalized_to_origin in lambda, I guess I have to write [& normalized_to_origin = normalized_to_origin]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, you're right!

return normalized_to_origin[e];
});
return original_vector;
}
std::set<float64_t> unique_labels;
std::unordered_map<float64_t, float64_t> normalized_to_origin;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inverse_mapping

};
} // namespace shogun

#endif
81 changes: 81 additions & 0 deletions src/shogun/labels/MulticlassLabelsEncoder.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
/*
* This software is distributed under BSD 3-clause license (see LICENSE file).
*
* Authors: Yuhui Liu
*
*/
#ifndef _MulticlassLabelsEncoder__H__
#define _MulticlassLabelsEncoder__H__

#include <memory>
#include <shogun/base/SGObject.h>
#include <shogun/labels/DenseLabels.h>
#include <shogun/labels/LabelEncoder.h>
#include <shogun/labels/MulticlassLabels.h>
#include <shogun/lib/SGVector.h>

namespace shogun
{

class MulticlassLabelsEncoder : public LabelEncoder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comments as for binary labels class

{
public:
MulticlassLabelsEncoder() = default;

~MulticlassLabelsEncoder() = default;

/** Fit label encoder
*
* @param Target values.
* @return SGVector which contains unique labels.
*/
SGVector<float64_t> fit(const std::shared_ptr<Labels>& labs) override
{
const auto result_vector = labs->as<DenseLabels>()->get_labels();
return fit_impl(result_vector);
}
/** Transform labels to normalized encoding.
*
* @param Target values to be transformed.
* @return Labels transformed to be normalized.
*/
std::shared_ptr<Labels>
transform(const std::shared_ptr<Labels>& labs) override
{
const auto result_vector = labs->as<DenseLabels>()->get_labels();
return std::make_shared<MulticlassLabels>(
transform_impl(result_vector));
}
/** Transform labels back to original encoding.
*
* @param normailzed encoding labels
* @return original encoding labels
*/
std::shared_ptr<Labels>
inverse_transform(const std::shared_ptr<Labels>& labs) override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we also need a check valid here? Something that ensures that the labels are contiguous? [0,1,2,3,4, ... ] no gaps.
I wrote some code for this in the links I posted. Either re-use or remove my code :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking whether we need a check valid here, as inverse_transform is to map from internal encoding to origin encoding. for example, {100, 100, 200, 300} -> {0, 0, 1, 2}, {0, 0, 1, 2} are transformed by internal encoding, but it is not continuous

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you are right of course :)

{
auto normalized_vector = labs->as<DenseLabels>()->get_labels();
return std::make_shared<MulticlassLabels>(
inverse_transform_impl(normalized_vector));
}
/** Fit label encoder and return encoded labels.
*
* @param Target values.
* @return Labels transformed to be normalized.
*/
std::shared_ptr<Labels>
fit_transform(const std::shared_ptr<Labels>& labs) override
{
const auto result_vector = labs->as<DenseLabels>()->get_labels();
return std::make_shared<MulticlassLabels>(
transform_impl(fit_impl(result_vector)));
}

virtual const char* get_name() const
{
return "MulticlassLabelsEncoder";
}
};
} // namespace shogun

#endif
Loading