[SPARK-50714][SQL] Enable schema evolution for TransformWithState when Avro encoding is used #49277

ericm-db · 2024-12-23T19:32:53Z

What changes were proposed in this pull request?

This PR introduces schema evolution support for stateful operators in Spark Structured Streaming by:

Adding support for Avro-based schema evolution in state store encoders
Introducing new classes and interfaces to manage schema metadata:

StateSchemaProvider interface and implementations
StateSchemaBroadcast for distributing schema information
StateSchemaMetadata to track schema versions

Modifying state store providers and encoders to handle schema evolution:

Updated RocksDBStateEncoder to support reading data with evolved schemas
Added schema ID tracking in StateStoreColFamilySchema
Modified state store initialization to support schema providers

Why are the changes needed?

Schema evolution is a critical feature for stateful stream processing applications that need to handle changing data schemas over time.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit and Integration tests in RocksDBStateStoreSuite and TransformWithStateSuite

Was this patch authored or co-authored using generative AI tooling?

.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala

anishshri-db · 2024-12-30T20:47:10Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

+      readerSchema: Schema,
+      valueProj: UnsafeProjection): UnsafeRow = {
+    if (valueBytes != null) {
+      val reader = new GenericDatumReader[Any](writerSchema, readerSchema)


lets add some comments here around the args

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

...n/scala/org/apache/spark/sql/execution/streaming/state/StateSchemaCompatibilityChecker.scala

...re/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasExec.scala

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

anishshri-db · 2025-01-03T02:01:54Z

sql/core/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

+
+  /**
+   * Converts a Spark SQL schema to a corresponding Avro schema.
+   * Handles nested types and adds support for schema evolution.


Could we add more details here ? Also maybe add comments for all the function args ?

anishshri-db · 2025-01-03T18:28:27Z

sql/core/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

+
+    dataType match {
+      // Basic types
+      case BooleanType => false


Are these Avro defaults too ?

anishshri-db · 2025-01-03T18:29:03Z

sql/core/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

+
+      // Complex types
+      case ArrayType(elementType, _) =>
+        val defaultArray = new java.util.ArrayList[Any]()


Why not have empty collections ? i.e. empty array/map etc ?

sql/core/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

anishshri-db · 2025-01-03T18:30:01Z

sql/core/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala

+      catalystType: DataType,
+      nullable: Boolean = false,
+      recordName: String = "topLevelRecord",
+      nameSpace: String = "",


just say namespace ? also - what does this refer to ?

anishshri-db · 2025-01-03T18:31:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

@@ -64,7 +64,9 @@ class IncrementalExecution(
    val watermarkPropagator: WatermarkPropagator,
    val isFirstBatch: Boolean,
    val currentStateStoreCkptId:
-      MutableMap[Long, Array[Array[String]]] = MutableMap[Long, Array[Array[String]]]())
+      MutableMap[Long, Array[Array[String]]] = MutableMap[Long, Array[Array[String]]](),
+    val stateSchemaMetadatas: MutableMap[Long, StateSchemaBroadcast] =


Lets add some comments for this ?

anishshri-db · 2025-01-03T18:32:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

+              val stateSchemaMetadata = StateSchemaMetadata.
+                createStateSchemaMetadata(checkpointLocation, hadoopConf, stateSchemaList.head)
+
+              val stateSchemaBroadcast =


lets note that the broadcast happens here for the first run

anishshri-db · 2025-01-03T18:40:27Z

...re/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasExec.scala

@@ -149,7 +149,8 @@ case class TransformWithStateInPandasExec(
    initialStateGroupingAttrs.map(SortOrder(_, Ascending)))

  override def operatorStateMetadata(
-      stateSchemaPaths: List[String]): OperatorStateMetadata = {
+      stateSchemaPaths: List[List[String]]
+  ): OperatorStateMetadata = {


nit: can fit in line above ?

anishshri-db · 2025-01-03T18:41:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ListStateMetricsImpl.scala

@@ -21,6 +21,10 @@ import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, UnsafeProj
 import org.apache.spark.sql.execution.streaming.state.{NoPrefixKeyStateEncoderSpec, StateStore}
 import org.apache.spark.sql.types._

+object ListStateMetricsImpl {


Lets combine this with some existing utils object ?

anishshri-db · 2025-01-03T18:41:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

@@ -135,6 +136,8 @@ class MicroBatchExecution(
  // operatorID -> (partitionID -> array of uniqueID)
  private val currentStateStoreCkptId = MutableMap[Long, Array[Array[String]]]()

+  private val stateSchemaMetadatas = MutableMap[Long, StateSchemaBroadcast]()


nit: lets add some comment here explaining what this is for

anishshri-db · 2025-01-03T18:43:06Z

.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala

-      hasTtl: Boolean): StateStoreColFamilySchema = {
-  StateStoreColFamilySchema(
+      hasTtl: Boolean): Map[String, StateStoreColFamilySchema] = {
+    val schemas = mutable.Map[String, StateStoreColFamilySchema]()


Why do we need these maps ?

anishshri-db · 2025-01-03T18:43:27Z

...re/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala

@@ -344,7 +347,7 @@ class StatefulProcessorHandleImpl(
 * the StatefulProcessor is initialized.
 */
 class DriverStatefulProcessorHandleImpl(timeMode: TimeMode, keyExprEnc: ExpressionEncoder[Any])
-  extends StatefulProcessorHandleImplBase(timeMode, keyExprEnc) {
+  extends StatefulProcessorHandleImplBase(timeMode, keyExprEnc) with Logging {


intentional ?

anishshri-db · 2025-01-03T18:44:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TimerStateImpl.scala

@@ -43,6 +43,15 @@ object TimerStateUtils {
      TimerStateUtils.PROC_TIMERS_STATE_NAME + TimerStateUtils.KEY_TO_TIMESTAMP_CF
    }
  }
+
+  def getTimerStateSecIndexName(timeMode: String): String = {


Could we combine with function above and just make a generic function ?

anishshri-db · 2025-01-03T18:45:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala

+    val columnFamilySchemas = getDriverProcessorHandle().getColumnFamilySchemas ++
+      Map(
+        StateStore.DEFAULT_COL_FAMILY_NAME ->
+          StateStoreColFamilySchema(StateStore.DEFAULT_COL_FAMILY_NAME,


indent seems off ?

anishshri-db · 2025-01-03T18:45:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala

  }

  /** Metadata of this stateful operator and its states stores. */
  override def operatorStateMetadata(
-      stateSchemaPaths: List[String]): OperatorStateMetadata = {
+      stateSchemaPaths: List[List[String]]
+  ): OperatorStateMetadata = {


nit: could go to line above ?

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Dec 23, 2024

ericm-db added 7 commits December 27, 2024 10:06

more stuff

9063e10

not building rn

5a0ba29

it builds - probably has a ton of shit wrong with it though

eff6284

stuff

332b648

removing assertion

eb4e4df

stuff

ae82712

all sorts of stuff fails

5c7d940

ericm-db force-pushed the ssm branch from 03c6297 to 5c7d940 Compare December 27, 2024 22:43

ericm-db added 3 commits December 27, 2024 14:54

schema evolution works?

fda69c8

works

2af87b2

stuff

cd4d1ff