-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#787 - Move encoder implementation details to external shim library (not dependent on the Spark 4 release) #800
base: master
Are you sure you want to change the base?
Conversation
…ifisnull, GetColumnByOrdinal, MapObjects and TypedExpressionEncoder shimmed
…ifisnull, GetColumnByOrdinal, MapObjects and TypedExpressionEncoder shimmed - attempt build
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #800 +/- ##
==========================================
+ Coverage 95.46% 95.60% +0.13%
==========================================
Files 67 65 -2
Lines 1257 1341 +84
Branches 42 52 +10
==========================================
+ Hits 1200 1282 +82
- Misses 57 59 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Is there an issue with the existing build? |
As described in #787 the 14.2 and 14.3 LTS Databricks Runtime cannot use Frameless 0.16 due to backporting Spark 4 changes. The core of the encoding derivation logic, at least, is essentially identical since 2.4 days / 2020 when I started making commits to support newer versions/runtimes, what changes is internal spark api usage. This PR is aims for a hot swappable jar based solution to such changes and would reduce dependency on the core of frameless (including committers) to support such runtime differences i.e. focus on oss major releases only * and have the runtime compatibility issues pushed to another library. * This is not strictly required for encoding either, per above a 0.16+shim Frameless base can encode on all versions from 3.1.3 through to 4.0 nightlies/14.3 DBR by swapping the runtime |
Yes, feels like we def need some kind of shims to simplify life of the DB folks who use frameless. |
…teStruct, and allow shims for the deprecated functions
per b880261, proper fix for #803 and #804 are confirmed as working on all LTS versions of Databricks, Spark 4 and the latest 15.0 runtime - test combinations are documented here |
…clusters - inifinity protection
…0 databricks doesn't process them on ordered dataset
…0 databricks doesn't process them on ordered dataset
…0 databricks doesn't process them on ordered dataset
NB 25cc5c3 test cases are able to run on a full cluster 14.3 and 15.0, most fails are due to randomness rather than actual test code issues now. |
omg that's huge; I'll try to get back to you soonish! |
RE: Spark4: If Spark 4 is such a breaking change we can just release frameless-spark4 artifacts; i.e. that's how tapir maintains play2.9 libs RE: DBR compat layer: This PR is hard to review 🤔 we need to figure smth with fmt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started doing some reviews, fmt is not our best friend, but the commit history is! 🎉
implicit class seqToRdd[A: ClassTag]( | ||
seq: Seq[A] | ||
)(implicit | ||
sc: SC) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, is it how scalafmt formats it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the builds won't run unless scalafmt is run on files, so yes, those settings lead to some hideous looking code.
val expectedSum = | ||
if (seq.isEmpty) None else Some(Foldable[List].fold(seq)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt is super weird
package frameless | ||
|
||
import org.apache.spark.sql.Encoder | ||
import org.apache.spark.sql.catalyst.expressions.Attribute | ||
import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, MapGroups => SMapGroups} | ||
|
||
object MapGroups { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q. What makes it impossible to keep these MapGroups here, and use the override from the shim in the client code?
The client code will pick up the version from shim and it will enforce MapGroups implementation from the dependency.
We can play around with it thats for sure, but I am curious how we can get it as far as possible to the user side not enforcing behavior / shims usage by the lib implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ Also given how convoluted https://github.com/sparkutils/shim looks, id really like to find a solution so user replaces / uses their own implementation of the critical classes (i.e. by pulling shims into deps)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ Also given how convoluted https://github.com/sparkutils/shim looks, id really like to find a solution so user replaces / uses their own implementation of the critical classes (i.e. by pulling shims into deps)
that's the point of the shim lib. When a user wants an upgrade (and the shim is there) they use the newer shim, no need to change the frameless version used. The complexity is there, it's just about how to manage it, I'd personally prefer it is in a single lib rather than in client code. Currently only the testing is version specific.
You could, per the other point, move the call to impl out into a typeclass , this wouldn't impact the users source all that much if a default impl passing to shim existed. (a recompile would be needed of course for the extra implicit). That could give the best of both worlds - point location fixes and a bunch of default fixes.
This just pushes the problem out again and keeps frameless focussed on spark differences. Along comes 4.1.0 and breaks more or 5 or ...
yeah, there should have been a reformat all files PR immediately after adding it in, ah well, hindsight etc. |
So this library is dependent on Spark, so I'd say we do the breaking change and move frameless to Spark 4, and keep maintenance releases if users ask for them. Maintaining cross major version releases via a shims lib could be a bit too much. BTW the same concerns are around the DB runtime:
MB we could have some abstraction that users could override themselves so we shift this responsibility to the user? |
I'll answer the last first - if it's typeclass based they can throw their own in - each location get's it (not sure this works for all off the top of my head but I could try it out). Version wise - frameless should only support it's usage interface, if someone -e.g. me- requires a funny version of frameless to work with an obscurity of databricks then let them provide the shim PR (or per above custom typeclass) - externalise the problem. The only things I'm aware of which aren't easy to take this approach with are changes which stop udf's or lit's working (the final gencode springs to mind as does the foldable pushdown issue) and of course the whole agnostic encoder mess. If you think the idea of using typeclasses to isolate this sounds reasonable I'm happy to give it a go. |
per #787 and #300 - The key files and code that has changed since the 2.4->3 migration is abstracted out and moved to arity based shims and helper functions. (this includes Spark 4 snapshot changes - see below). If the approach and scope of Shim usage within Frameless is ok, I'll push out an actual 0.0.1 cut (excluding Spark4), in the meantime the snaps are on central.
Frameless' use of the internal apis for extending or mixing in is thusfar (aside from the lit pushdown issue) isolated to the shim interfaces for encoding usage (and creating analysisexception). As such a version compiled from the shimmed 0.16 against 3.5.0 allows encoding to be used on 14.3 LTS (3.5 with 4.0 StaticInvoke) down to 9.1 LTS (3.1.3) without issue.
of note - I've run Quality tests against Spark 4 SNAPSHOT proper (with the Frameless3.5 build) and in a local build of Spark 4 Frameless all tests pass - although cats tests sometimes freeze when run directly (on "inner pairwise monoid") (I've also run the Quality tests against Frameless built against 4).
Lit and UDF on base Expression are (as of 29th Feb) are stable api wise against Spark 4 so there is no need for shims there.
NB: I've only pushed a Spark 4 shim_runtime against 2.13 into snapshots, I won't push any full versions until RC's drop
If the approach / change scope is ok I'll push release versions of shim out and remove the resolvers:
per #787 - let me know if I should shim something else (I don't see Dataset/SparkSession etc. as being useful, but I'd be happy to move the scala reflection stuff over to shim, not that it's currently needed)
NB (Spark 4 changes:
build sbt needs the correct shim_runtime_4.0.0.oss_4.0 dependency and 2.13 main scala version (as well as jdk 17/21) . Comments have the versions.
Source (changes compatible with 0.16 builds):
In order to run tests on jdk 17/21 you'll need this adding to the vm args:
)