[YAML]: add optional schema config for all transforms #35952

derrickaw · 2025-08-25T16:42:24Z

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

derrickaw · 2025-08-25T18:00:42Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a valuable feature by allowing an optional output_schema to be specified for any YAML transform. This enables schema validation and robust error handling across the pipeline. The implementation is well-structured, adding a Validate transform dynamically after a transform's execution. The logic correctly handles various scenarios, including schemaless PCollections, single and multiple outputs, and error handling configurations. The changes are supported by a comprehensive set of new tests and clear documentation. I have a few minor suggestions to improve the documentation and test file formatting.

sdks/python/apache_beam/yaml/tests/create.yaml

website/www/site/content/en/documentation/sdks/yaml-schema.md

github-actions · 2025-08-25T20:07:30Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

derrickaw · 2025-08-25T20:13:00Z

assign set of reviewers

github-actions · 2025-08-25T20:14:05Z

Assigning reviewers:

R: @jrmccluskey for label python.
R: @damccorm for label website.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

jrmccluskey

Largely LGTM, just one nit

jrmccluskey · 2025-08-26T15:12:13Z

sdks/python/apache_beam/yaml/yaml_transform.py

+  # the validation downstream.
+  clean_schema = SafeLineLoader.strip_metadata(spec)
+
+  def enforce_schema(pcoll, label, error_handling_spec):


does this function need to be defined within expand_output_schema_transform() ?

no, updated, thanks!

damccorm

Thanks, this is excellent - I had one broad comment, but overall this is a great change

damccorm · 2025-08-26T15:11:14Z

sdks/python/apache_beam/yaml/json_utils.py

        for (name, convert) in converters.items()
+        if (converted := convert(getattr(row, name, None))) is not None


When would this be None? What condition are we guarding against?

In theory I'd expect the previous conversion to be a bit more correct.

I was running through many scenarios during development and ran into an issue with this, but I don't think its needed anymore. Good catch. Thanks.

Figured it out after Validate_with_schema test failed after the revert :) :
So there is a bug in that transform for Null fields. The validator treats it as a failed row if the schema has one thing and the field is None. So we filter out those fields if they are None and let the Validator validate on that row. For example:

BeamSchema_....(name='Bob', score=None, age=25)
During the conversion process to json -> {'name': 'Bob', 'score': None, 'age': 25}
Validation will fail on this row with this schema:
{'type': 'object', 'properties': {'name': {'type': 'string'}, 'age': {'type': 'integer'}, 'score': {'type': 'number'}}}

But if we convert that BeamRow to -> {'name': 'Bob', 'age': 25}
Then it passes fine.

My understanding of the code base is that we would have to update the jsonschema package to allow None, but that seems like a non-starter.

Ok. I think this is fine - arguably both should error, but I think making all fields nullable is ok.

Eventually it might be good to add a optional: False field so that we can validate null fields as errors.

I am fine leaving as is for now though

damccorm · 2025-08-26T15:18:13Z

sdks/python/apache_beam/yaml/tests/assign_timestamps.yaml

+              error_handling:
+                output: invalid_schema_rows


Haven't gotten to the point where we actually implement this, but it seems to me like it might be more useful to just have a single error_handling output associated with a transform where we capture all problematic records. In most cases, a user is going to want to write these somewhere for manual inspection/reprocessing and I don't think it matters too much whether it is because the output is an error or the output doesn't match the expected schema.

This could be done with a flatten step where you unify the error output from schema validation and normal exception handling.

I was thinking the user may want more control over this and easily filter out the issue in the beginning based on generic issues with the transform versus schema issues.

In practice, they're both probably going to be data issues which users want to pipe to some exception handling sink, so I think treating them as the same collection probably makes sense.

Will discuss with group also to finalize. Thanks.

damccorm · 2025-08-26T15:29:21Z

website/www/site/content/en/documentation/sdks/yaml-schema.md

+          - {sdk: MillWheel, year: 2008}
+```
+
+However, a user will more likely want to detect and handle schema errors. This is where adding an `error_handling` configuration inside the `output_schema` comes into play. For example, the following code will


If we stay with the current approach, we should call out the difference between normal error handling and schema error handling here.

added another paragraph below, thanks

derrickaw · 2025-08-29T03:10:03Z

Offline decision is to revamp this design to only have one error_handling. Updates pending...

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

derrickaw · 2025-09-04T14:31:32Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a valuable feature: an optional output_schema configuration for all YAML transforms. This allows for schema validation on the output of any transform, enhancing data quality and pipeline robustness. The implementation is well-structured, handling various scenarios like single vs. multiple outputs and integrating seamlessly with existing error handling. The addition of new tests and comprehensive documentation, including a new yaml-schema.md page, is commendable. I've identified a couple of minor typos in user-facing text and have provided suggestions for correction. Overall, this is a solid contribution.

sdks/python/apache_beam/yaml/yaml_transform.py

website/www/site/content/en/documentation/sdks/yaml-schema.md

derrickaw · 2025-09-04T18:49:38Z

Run Python PreCommit 3.12

damccorm

Thanks! Some small remaining comments, but overall this is looking good

damccorm · 2025-09-05T20:50:53Z

sdks/python/apache_beam/yaml/yaml_transform.py

+    "error_handling config on a capable transform or the user can remove the" \
+    "output_schema config on this transform and add a ValidateWithSchema " \
+    "transform downstream of the current transform.")


Suggested change

"error_handling config on a capable transform or the user can remove the" \

"output_schema config on this transform and add a ValidateWithSchema " \

"transform downstream of the current transform.")

"error_handling config on a capable transform. Alternatively, you can remove the" \

"output_schema config on this transform and add a ValidateWithSchema " \

"transform with separate error handling downstream of the current transform.")

Nit for grammar/clarity

(may break the linter, so you might want to apply the change yourself rather than committing from the UI)

done, thanks

damccorm · 2025-09-05T20:59:46Z

sdks/python/apache_beam/yaml/yaml_transform.py

+  # The transform produced outputs with many named PCollections and need to
+  # determine which PCollection should be validated on.
+  elif isinstance(outputs, dict):
+    main_output_key = _get_main_output_key(spec, outputs)


This behavior is a little surprising to me. If I have a PCollection which produces multiple non-error outputs, there are 3 possible behaviors I might expect:

Validate only the main output (what you're doing here)

Validate all non-error outputs (what I was expecting this to do)

Throw and tell the user they should use the ValidateTransform instead

I think any of the 3 are reasonable. I'd probably lean towards (3) because it protects the user from accidental bad behavior, but it is also less convenient. I'll defer to you on what you want to do here (leaving it is fine), but if we do leave it I would recommend:

Making sure we clearly document this (I don't think its called out right now, and I would document it no matter which behavior we choose)

Explicitly testing this behavior

Add a warning explaining that only the main output will be validated when we can definitely tell there's multiple outputs (aka there are 3+ outputs, or 2 outputs with no error output)

I went with keeping it as is.

There is a doc string for the get_main_output_key method. Is that sufficient?

Done

There is a warning in that get_main_output_key method. Is that sufficient?

Thanks.

damccorm · 2025-09-05T21:00:40Z

sdks/python/apache_beam/yaml/yaml_transform.py

+            | f'FlattenErrors_{main_output_key}' >> beam.Flatten())
+      else:
+        # No error output in the original transform, so just add this one.
+        outputs[error_output_tag] = schema_error_pcoll


Is this possible?

Not anymore. Thanks.

codecov · 2025-09-06T16:33:03Z

Codecov Report

❌ Patch coverage is 77.21519% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.81%. Comparing base (35969b3) to head (1b54a9c).
⚠️ Report is 22 commits behind head on master.

Files with missing lines	Patch %	Lines
sdks/python/apache_beam/yaml/yaml_transform.py	79.45%	15 Missing ⚠️
sdks/python/apache_beam/yaml/yaml_provider.py	0.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #35952      +/-   ##
============================================
+ Coverage     56.79%   56.81%   +0.01%     
  Complexity     3385     3385              
============================================
  Files          1220     1220              
  Lines        185122   185201      +79     
  Branches       3508     3508              
============================================
+ Hits         105148   105216      +68     
- Misses        76649    76660      +11     
  Partials       3325     3325

Flag	Coverage Δ
python	`81.00% <77.21%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

derrickaw · 2025-09-06T20:33:48Z

Run Python_Transforms PreCommit 3.10

github-actions bot added python website yaml labels Aug 25, 2025

gemini-code-assist bot reviewed Aug 25, 2025

View reviewed changes

sdks/python/apache_beam/yaml/tests/create.yaml Outdated Show resolved Hide resolved

website/www/site/content/en/documentation/sdks/yaml-schema.md Outdated Show resolved Hide resolved

website/www/site/content/en/documentation/sdks/yaml-schema.md Outdated Show resolved Hide resolved

derrickaw force-pushed the optional_schema branch from 5b11920 to 52a4a8e Compare August 25, 2025 18:24

derrickaw marked this pull request as ready for review August 25, 2025 19:48

github-actions bot added the Next Action: Reviewers label Aug 25, 2025

jrmccluskey approved these changes Aug 26, 2025

View reviewed changes

damccorm reviewed Aug 26, 2025

View reviewed changes

derrickaw marked this pull request as draft August 29, 2025 03:10

derrickaw force-pushed the optional_schema branch from 9e387bb to 65ef538 Compare September 3, 2025 16:56

derrickaw and others added 14 commits September 4, 2025 14:29

rebase - fix changes confilct

1f326fe

fix some errors

8bf8a96

add comment

08d2a73

fix whitespace

dd9cd98

Update website/www/site/content/en/documentation/sdks/yaml-schema.md

4cf017d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update website/www/site/content/en/documentation/sdks/yaml-schema.md

4c5adb9

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

address gemeni review comment and lint issue

52dfffc

address comments

2986eb1

revert to filtering out None fields

a451209

update yaml files based on combined error_handling output

c3d16f9

update readmes and docs

5cd804d

update tests based on new logic

f3778ab

update code logic based on new output design

57c8930

remove changes to text.yaml

c8d0eee

derrickaw added 2 commits September 4, 2025 14:29

remove old logic

a0416da

fix wording

e793e65

derrickaw force-pushed the optional_schema branch from 65ef538 to e793e65 Compare September 4, 2025 14:30

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

sdks/python/apache_beam/yaml/yaml_transform.py Outdated Show resolved Hide resolved

website/www/site/content/en/documentation/sdks/yaml-schema.md Outdated Show resolved Hide resolved

derrickaw added 2 commits September 4, 2025 14:48

update comments etc

4cd35e9

fix github action test

b1c1525

derrickaw marked this pull request as ready for review September 4, 2025 20:28

derrickaw requested a review from damccorm September 5, 2025 20:32

damccorm reviewed Sep 5, 2025

View reviewed changes

derrickaw added 2 commits September 6, 2025 15:06

add unit test for get_main_output_key

51e89d4

update comments and logic

2e55000

add another yaml transform test to cover code change

1b54a9c

		for (name, convert) in converters.items()
		if (converted := convert(getattr(row, name, None))) is not None

[YAML]: add optional schema config for all transforms #35952

Are you sure you want to change the base?

[YAML]: add optional schema config for all transforms #35952

Uh oh!

Conversation

derrickaw commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

derrickaw commented Aug 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Aug 25, 2025

Uh oh!

derrickaw commented Aug 25, 2025

Uh oh!

github-actions bot commented Aug 25, 2025

Uh oh!

jrmccluskey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damccorm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damccorm Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

derrickaw Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

derrickaw commented Aug 29, 2025

Uh oh!

derrickaw commented Sep 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

derrickaw commented Sep 4, 2025

Uh oh!

damccorm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

derrickaw commented Aug 25, 2025 •

edited

Loading

damccorm Aug 27, 2025 •

edited

Loading

derrickaw Aug 26, 2025 •

edited

Loading

codecov bot commented Sep 6, 2025 •

edited

Loading