Skip to content

[ISSUE] publish_table() intermittently fails silently after source table recreation - possible race condition (large tables) #1134

@msetsma

Description

@msetsma

Description

publish_table() silently fails when recreating an online table after:

  1. Deleting the online table via delete_online_table()
  2. Dropping and recreating the source offline feature table
  3. Calling publish_table()

The method returns a PublishedTable object with no exception. The online table is never created. Failure is only found by querying the table.

Reproduction

from databricks.feature_engineering import FeatureEngineeringClient
from databricks.sdk import WorkspaceClient

fe_client = FeatureEngineeringClient()
ws_client = WorkspaceClient()

ONLINE_TABLE = "catalog.schema.my_online_table"
OFFLINE_TABLE = "catalog.schema.my_feature_table"
ONLINE_STORE = "my-online-store"

# Step 1: Delete online table
fe_client.delete_online_table(name=ONLINE_TABLE)

# Step 2: Drop offline feature table
fe_client.drop_table(name=OFFLINE_TABLE)

# Step 3: Recreate offline feature table
fe_client.create_table(
    name=OFFLINE_TABLE,
    primary_keys=["id"],
    df=spark.table("source_data"),
    description="Recreated table",
)

# Step 4: Enable CDF
spark.sql(f"ALTER TABLE {OFFLINE_TABLE} SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")

# Step 5: Republish — returns success, no exception
online_store = fe_client.get_online_store(name=ONLINE_STORE)
result = fe_client.publish_table(
    online_store=online_store,
    source_table_name=OFFLINE_TABLE,
    online_table_name=ONLINE_TABLE,
)

# Step 6: Query fails
spark.table(ONLINE_TABLE).count()
# AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view does not exist

Expected Behavior

One of:

  1. Preferred: publish_table() creates the online table from the recreated source table.

  2. Alternative: publish_table() raises an explicit error:

    OnlineStoreStateError: Online store 'my-online-store' has stale references 
    to dropped table 'catalog.schema.my_feature_table'. Drop and recreate the 
    online store, or use force_recreate=True.
    
  3. Minimum: Return a status object indicating failure instead of a success response.

Workaround

Drop and recreate the entire online store before republishing:

try:
    fe_client.delete_online_store(name=ONLINE_STORE)
except Exception:
    pass

fe_client.create_online_store(name=ONLINE_STORE, capacity="CU_1")

This workaround is unreliable, switching to the REST API gave better results than the SDK.

Workaround is also destructive for production systems where the online store serves multiple tables.

Is it a regression?

Unknown.

Environment

  • Databricks Runtime: 16.4 LTS ML
  • databricks-feature-engineering: tested with >=0.13.0a3 and >=0.13.0
  • Compute: Serverless

Debug Logs

Available on request.

Additional Context

  • Returned PublishedTable object is identical to successful creation
  • Online store reports State.AVAILABLE before and after failed publish
  • get_online_store() reports no issues
  • Other scenarios (e.g., publishing to existing table) emit warnings — this silent failure is inconsistent with existing error handling
  • Drop → recreate → republish is a standard workflow during feature development

Suspected Root Cause

Online store maintains internal references to source table metadata. These references become invalid after table drop/recreate. publish_table() does not validate this state before returning success.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions