Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#787 - Move encoder implementation details to external shim library (not dependent on the Spark 4 release) #800

Open
wants to merge 74 commits into
base: master
Choose a base branch
from

Conversation

chris-twiner
Copy link
Contributor

per #787 and #300 - The key files and code that has changed since the 2.4->3 migration is abstracted out and moved to arity based shims and helper functions. (this includes Spark 4 snapshot changes - see below). If the approach and scope of Shim usage within Frameless is ok, I'll push out an actual 0.0.1 cut (excluding Spark4), in the meantime the snaps are on central.

Frameless' use of the internal apis for extending or mixing in is thusfar (aside from the lit pushdown issue) isolated to the shim interfaces for encoding usage (and creating analysisexception). As such a version compiled from the shimmed 0.16 against 3.5.0 allows encoding to be used on 14.3 LTS (3.5 with 4.0 StaticInvoke) down to 9.1 LTS (3.1.3) without issue.

of note - I've run Quality tests against Spark 4 SNAPSHOT proper (with the Frameless3.5 build) and in a local build of Spark 4 Frameless all tests pass - although cats tests sometimes freeze when run directly (on "inner pairwise monoid") (I've also run the Quality tests against Frameless built against 4).

Lit and UDF on base Expression are (as of 29th Feb) are stable api wise against Spark 4 so there is no need for shims there.

NB: I've only pushed a Spark 4 shim_runtime against 2.13 into snapshots, I won't push any full versions until RC's drop

If the approach / change scope is ok I'll push release versions of shim out and remove the resolvers:

// needed for shim_runtim snapshots
resolvers in Global += MavenRepository(
  "sonatype-s01-snapshots",
  Resolver.SonatypeS01RepositoryRoot + "/snapshots"
)
// needed for 4.0 snapshots
resolvers in Global += MavenRepository(
  "apache_snaps",
  "https://repository.apache.org/content/repositories/snapshots"
)

per #787 - let me know if I should shim something else (I don't see Dataset/SparkSession etc. as being useful, but I'd be happy to move the scala reflection stuff over to shim, not that it's currently needed)

NB (Spark 4 changes:

build sbt needs the correct shim_runtime_4.0.0.oss_4.0 dependency and 2.13 main scala version (as well as jdk 17/21) . Comments have the versions.

Source (changes compatible with 0.16 builds):

  1. Swapped FramelessInternals to use a shim to create an AnalysisException. (different args on Spark 4)
  2. TypedColumn gets actual imports from org.apache.spark.sql.catalyst.expressions (new With Expression in Spark 4)
  3. Pushdown tests needs to use the previous currentTimestamp code (Spark 4 removed it, could shim this if preferred)
  4. SchemaTests.structToNonNullable resets the metadata, Spark 4 sets meta data so the properties don't hold and you'll get:
Expected StructType(StructField(_1,LongType,false),StructField(_2,LongType,false)) but got StructType(StructField(_1,LongType,false),StructField(_2,LongType,false))

In order to run tests on jdk 17/21 you'll need this adding to the vm args:

--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
--add-opens=java.base/java.io=ALL-UNNAMED
--add-opens=java.base/java.net=ALL-UNNAMED
--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED
--add-opens=java.base/sun.security.action=ALL-UNNAMED
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED

)

Copy link

codecov bot commented Mar 1, 2024

Codecov Report

Attention: Patch coverage is 96.81529% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 95.60%. Comparing base (0fb9c58) to head (986891a).
Report is 1 commits behind head on master.

❗ Current head 986891a differs from pull request most recent head 25cc5c3. Consider uploading reports for the commit 25cc5c3 to get more accurate results

Files Patch % Lines
...taset/src/main/scala/frameless/RecordEncoder.scala 92.85% 5 Missing ⚠️
...taset/src/main/scala/frameless/functions/Udf.scala 93.10% 4 Missing ⚠️
.../src/main/scala/frameless/FramelessInternals.scala 96.42% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #800      +/-   ##
==========================================
+ Coverage   95.46%   95.60%   +0.13%     
==========================================
  Files          67       65       -2     
  Lines        1257     1341      +84     
  Branches       42       52      +10     
==========================================
+ Hits         1200     1282      +82     
- Misses         57       59       +2     
Flag Coverage Δ
2.12-root-spark33 95.30% <95.54%> (-0.16%) ⬇️
2.12-root-spark34 ?
2.12-root-spark35 95.52% <96.49%> (+0.06%) ⬆️
2.13-root-spark35 96.08% <97.10%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cchantep
Copy link
Collaborator

cchantep commented Mar 2, 2024

Is there an issue with the existing build?
Otherwise I'm not sure we want to introduce such upstream dependency.

@chris-twiner
Copy link
Contributor Author

Is there an issue with the existing build? Otherwise I'm not sure we want to introduce such upstream dependency.

As described in #787 the 14.2 and 14.3 LTS Databricks Runtime cannot use Frameless 0.16 due to backporting Spark 4 changes. The core of the encoding derivation logic, at least, is essentially identical since 2.4 days / 2020 when I started making commits to support newer versions/runtimes, what changes is internal spark api usage.

This PR is aims for a hot swappable jar based solution to such changes and would reduce dependency on the core of frameless (including committers) to support such runtime differences i.e. focus on oss major releases only * and have the runtime compatibility issues pushed to another library.

* This is not strictly required for encoding either, per above a 0.16+shim Frameless base can encode on all versions from 3.1.3 through to 4.0 nightlies/14.3 DBR by swapping the runtime

@pomadchin
Copy link
Member

Yes, feels like we def need some kind of shims to simplify life of the DB folks who use frameless.

@chris-twiner
Copy link
Contributor Author

per b880261, proper fix for #803 and #804 are confirmed as working on all LTS versions of Databricks, Spark 4 and the latest 15.0 runtime - test combinations are documented here

@chris-twiner
Copy link
Contributor Author

NB 25cc5c3 test cases are able to run on a full cluster 14.3 and 15.0, most fails are due to randomness rather than actual test code issues now.

@pomadchin
Copy link
Member

pomadchin commented Jun 18, 2024

omg that's huge; I'll try to get back to you soonish!

@pomadchin
Copy link
Member

RE: Spark4: If Spark 4 is such a breaking change we can just release frameless-spark4 artifacts; i.e. that's how tapir maintains play2.9 libs

RE: DBR compat layer: This PR is hard to review 🤔 we need to figure smth with fmt.

Copy link
Member

@pomadchin pomadchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started doing some reviews, fmt is not our best friend, but the commit history is! 🎉

Comment on lines +40 to +43
implicit class seqToRdd[A: ClassTag](
seq: Seq[A]
)(implicit
sc: SC) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, is it how scalafmt formats it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the builds won't run unless scalafmt is run on files, so yes, those settings lead to some hideous looking code.

Comment on lines +138 to +139
val expectedSum =
if (seq.isEmpty) None else Some(Foldable[List].fold(seq))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt is super weird

Comment on lines -1 to -7
package frameless

import org.apache.spark.sql.Encoder
import org.apache.spark.sql.catalyst.expressions.Attribute
import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, MapGroups => SMapGroups}

object MapGroups {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q. What makes it impossible to keep these MapGroups here, and use the override from the shim in the client code?

The client code will pick up the version from shim and it will enforce MapGroups implementation from the dependency.

We can play around with it thats for sure, but I am curious how we can get it as far as possible to the user side not enforcing behavior / shims usage by the lib implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Also given how convoluted https://github.com/sparkutils/shim looks, id really like to find a solution so user replaces / uses their own implementation of the critical classes (i.e. by pulling shims into deps)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Also given how convoluted https://github.com/sparkutils/shim looks, id really like to find a solution so user replaces / uses their own implementation of the critical classes (i.e. by pulling shims into deps)

that's the point of the shim lib. When a user wants an upgrade (and the shim is there) they use the newer shim, no need to change the frameless version used. The complexity is there, it's just about how to manage it, I'd personally prefer it is in a single lib rather than in client code. Currently only the testing is version specific.

You could, per the other point, move the call to impl out into a typeclass , this wouldn't impact the users source all that much if a default impl passing to shim existed. (a recompile would be needed of course for the extra implicit). That could give the best of both worlds - point location fixes and a bunch of default fixes.

@chris-twiner
Copy link
Contributor Author

RE: Spark4: If Spark 4 is such a breaking change we can just release frameless-spark4 artifacts; i.e. that's how tapir maintains play2.9 libs

This just pushes the problem out again and keeps frameless focussed on spark differences. Along comes 4.1.0 and breaks more or 5 or ...

RE: DBR compat layer: This PR is hard to review 🤔 we need to figure smth with fmt.

yeah, there should have been a reformat all files PR immediately after adding it in, ah well, hindsight etc.

@pomadchin
Copy link
Member

@chris-twiner

This just pushes the problem out again and keeps frameless focussed on spark differences. Along comes 4.1.0 and breaks more or 5 or ...

So this library is dependent on Spark, so I'd say we do the breaking change and move frameless to Spark 4, and keep maintenance releases if users ask for them.

Maintaining cross major version releases via a shims lib could be a bit too much.

BTW the same concerns are around the DB runtime:

  • how many versions are we gonna support?
  • who supports them?
  • can users support their own DB version runtime that is not yet published?

MB we could have some abstraction that users could override themselves so we shift this responsibility to the user?

@chris-twiner
Copy link
Contributor Author

So this library is dependent on Spark, so I'd say we do the breaking change and move frameless to Spark 4, and keep maintenance releases if users ask for them.

Maintaining cross major version releases via a shims lib could be a bit too much.

BTW the same concerns are around the DB runtime:

  • how many versions are we gonna support?
  • who supports them?
  • can users support their own DB version runtime that is not yet published?

MB we could have some abstraction that users could override themselves so we shift this responsibility to the user?

I'll answer the last first - if it's typeclass based they can throw their own in - each location get's it (not sure this works for all off the top of my head but I could try it out).

Version wise - frameless should only support it's usage interface, if someone -e.g. me- requires a funny version of frameless to work with an obscurity of databricks then let them provide the shim PR (or per above custom typeclass) - externalise the problem. The only things I'm aware of which aren't easy to take this approach with are changes which stop udf's or lit's working (the final gencode springs to mind as does the foldable pushdown issue) and of course the whole agnostic encoder mess.

If you think the idea of using typeclasses to isolate this sounds reasonable I'm happy to give it a go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants