[YAML]: add optional schema config for all transforms #35952

derrickaw · 2025-08-25T16:42:24Z

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

derrickaw · 2025-08-25T18:00:42Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a valuable feature by allowing an optional output_schema to be specified for any YAML transform. This enables schema validation and robust error handling across the pipeline. The implementation is well-structured, adding a Validate transform dynamically after a transform's execution. The logic correctly handles various scenarios, including schemaless PCollections, single and multiple outputs, and error handling configurations. The changes are supported by a comprehensive set of new tests and clear documentation. I have a few minor suggestions to improve the documentation and test file formatting.

sdks/python/apache_beam/yaml/tests/create.yaml

website/www/site/content/en/documentation/sdks/yaml-schema.md

github-actions · 2025-08-25T20:07:30Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

derrickaw · 2025-08-25T20:13:00Z

assign set of reviewers

github-actions · 2025-08-25T20:14:05Z

Assigning reviewers:

R: @jrmccluskey for label python.
R: @damccorm for label website.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

jrmccluskey

Largely LGTM, just one nit

jrmccluskey · 2025-08-26T15:12:13Z

sdks/python/apache_beam/yaml/yaml_transform.py

+  # the validation downstream.
+  clean_schema = SafeLineLoader.strip_metadata(spec)
+
+  def enforce_schema(pcoll, label, error_handling_spec):


does this function need to be defined within expand_output_schema_transform() ?

no, updated, thanks!

damccorm

Thanks, this is excellent - I had one broad comment, but overall this is a great change

damccorm · 2025-08-26T15:11:14Z

sdks/python/apache_beam/yaml/json_utils.py

        for (name, convert) in converters.items()
+        if (converted := convert(getattr(row, name, None))) is not None


When would this be None? What condition are we guarding against?

In theory I'd expect the previous conversion to be a bit more correct.

I was running through many scenarios during development and ran into an issue with this, but I don't think its needed anymore. Good catch. Thanks.

Figured it out after Validate_with_schema test failed after the revert :) :
So there is a bug in that transform for Null fields. The validator treats it as a failed row if the schema has one thing and the field is None. So we filter out those fields if they are None and let the Validator validate on that row. For example:

BeamSchema_....(name='Bob', score=None, age=25)
During the conversion process to json -> {'name': 'Bob', 'score': None, 'age': 25}
Validation will fail on this row with this schema:
{'type': 'object', 'properties': {'name': {'type': 'string'}, 'age': {'type': 'integer'}, 'score': {'type': 'number'}}}

But if we convert that BeamRow to -> {'name': 'Bob', 'age': 25}
Then it passes fine.

My understanding of the code base is that we would have to update the jsonschema package to allow None, but that seems like a non-starter.

Ok. I think this is fine - arguably both should error, but I think making all fields nullable is ok.

Eventually it might be good to add a optional: False field so that we can validate null fields as errors.

I am fine leaving as is for now though

damccorm · 2025-08-26T15:18:13Z

sdks/python/apache_beam/yaml/tests/assign_timestamps.yaml

+              error_handling:
+                output: invalid_schema_rows


Haven't gotten to the point where we actually implement this, but it seems to me like it might be more useful to just have a single error_handling output associated with a transform where we capture all problematic records. In most cases, a user is going to want to write these somewhere for manual inspection/reprocessing and I don't think it matters too much whether it is because the output is an error or the output doesn't match the expected schema.

This could be done with a flatten step where you unify the error output from schema validation and normal exception handling.

I was thinking the user may want more control over this and easily filter out the issue in the beginning based on generic issues with the transform versus schema issues.

In practice, they're both probably going to be data issues which users want to pipe to some exception handling sink, so I think treating them as the same collection probably makes sense.

Will discuss with group also to finalize. Thanks.

damccorm · 2025-08-26T15:29:21Z

website/www/site/content/en/documentation/sdks/yaml-schema.md

+          - {sdk: MillWheel, year: 2008}
+```
+
+However, a user will more likely want to detect and handle schema errors. This is where adding an `error_handling` configuration inside the `output_schema` comes into play. For example, the following code will


If we stay with the current approach, we should call out the difference between normal error handling and schema error handling here.

added another paragraph below, thanks

derrickaw · 2025-08-29T03:10:03Z

Offline decision is to revamp this design to only have one error_handling. Updates pending...

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

github-actions bot added python website yaml labels Aug 25, 2025

gemini-code-assist bot reviewed Aug 25, 2025

View reviewed changes

sdks/python/apache_beam/yaml/tests/create.yaml Outdated Show resolved Hide resolved

website/www/site/content/en/documentation/sdks/yaml-schema.md Outdated Show resolved Hide resolved

website/www/site/content/en/documentation/sdks/yaml-schema.md Outdated Show resolved Hide resolved

derrickaw force-pushed the optional_schema branch from 5b11920 to 52a4a8e Compare August 25, 2025 18:24

derrickaw marked this pull request as ready for review August 25, 2025 19:48

github-actions bot added the Next Action: Reviewers label Aug 25, 2025

jrmccluskey approved these changes Aug 26, 2025

View reviewed changes

damccorm reviewed Aug 26, 2025

View reviewed changes

derrickaw marked this pull request as draft August 29, 2025 03:10

derrickaw and others added 15 commits September 3, 2025 16:53

rebase - fix changes confilct

14277fc

fix some errors

815354d

add comment

ff2748a

fix whitespace

0fcde6f

Update website/www/site/content/en/documentation/sdks/yaml-schema.md

685637b

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update website/www/site/content/en/documentation/sdks/yaml-schema.md

4623803

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

address gemeni review comment and lint issue

e42f671

address comments

aa02f37

revert to filtering out None fields

ed7ef6d

update yaml files based on combined error_handling output

c1dcae6

update readmes and docs

c317379

update tests based on new logic

6f6fb4b

update code logic based on new output design

4e73f37

remove changes to text.yaml

7d5564b

remove old logic

4bd9abd

fix wording

65ef538

derrickaw force-pushed the optional_schema branch from 9e387bb to 65ef538 Compare September 3, 2025 16:56

		for (name, convert) in converters.items()
		if (converted := convert(getattr(row, name, None))) is not None

[YAML]: add optional schema config for all transforms #35952

Are you sure you want to change the base?

[YAML]: add optional schema config for all transforms #35952

Conversation

derrickaw commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

derrickaw commented Aug 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 25, 2025

Uh oh!

derrickaw commented Aug 25, 2025

Uh oh!

github-actions bot commented Aug 25, 2025

Uh oh!

jrmccluskey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damccorm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damccorm Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

derrickaw Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

derrickaw commented Aug 29, 2025

Uh oh!

Uh oh!

derrickaw commented Aug 25, 2025 •

edited

Loading

damccorm Aug 27, 2025 •

edited

Loading

derrickaw Aug 26, 2025 •

edited

Loading