-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TypedOneHotEncoder #322
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #322 +/- ##
==========================================
+ Coverage 94.83% 96.36% +1.53%
==========================================
Files 53 56 +3
Lines 968 991 +23
Branches 9 9
==========================================
+ Hits 918 955 +37
+ Misses 50 36 -14
Continue to review full report at Codecov.
|
@manuzhang thanks for the PR! Will be looking at this shortly (I hope!). |
} | ||
|
||
test("create() compiles only with correct inputs") { | ||
illTyped("TypedOneHotEncoder.create[Double]()") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you meant to use .apply[]
instead of .create[]
?
@manuzhang I had a quick look at the diff and it looks good to me, but this is an area for frameless I'm not very familiar with. @atamborrino Would you have time for a review? |
@manuzhang I am taking a closer look at the PR. I think this implementation can be made better and more type-safe. A nice API would be: val ds: TypedDataset[(String, Int, Long)] = ...
ds.transformOneHot(ds('_1)): TypedDataset[(String, Int, Long, Vector)] The above will not compile if you try to do one-hot encoding to any column that is not String or Char, so you can add further type-safety restriction on the type of column you want to transform. Also, note how the type of the resulting dataset correctly has a new column of type vector appended to the end. |
@imarios |
@manuzhang I totally dropped the ball on this PR. Let me go over this one more time and merge the PR. |
ml/src/main/scala/frameless/ml/feature/TypedOneHotEncoder.scala
Outdated
Show resolved
Hide resolved
ml/src/main/scala/frameless/ml/feature/TypedOneHotEncoder.scala
Outdated
Show resolved
Hide resolved
|
||
case class Outputs(output: Vector) | ||
|
||
sealed abstract class HandleInvalid(val sparkValue: String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sealed abstract class HandleInvalid(val sparkValue: String) | |
final class HandleInvalid private(val sparkValue: String) extends AnyVal |
This adds a typed API over Spark ML's
OneHotEncoderEstimator
since Spark 2.3.0