Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OrcSchemaConversionValidator to avoid infinite recursion in AvroOrcSchemaConverter.getOrcSchema() #3794

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wsarecv
Copy link
Contributor

@wsarecv wsarecv commented Oct 5, 2023

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
  • The issue to solve is the endless recursion when creating a GobblinOrcWriter instance, during which the recursive call for AvroOrcSchemaConverter.getOrcSchema() never ends and causes StackOverflowError.
  • This PR introduces OrcSchemaConversionValidator which can skip the topic that can cause the above issue.
  • This PR also enhances the TopicValidatorsTest to cover the case: when validation throws exception, a topic should be always treated as "valid".

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
  • added OrcSchemaConversionValidatorTest

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link

codecov-commenter commented Oct 5, 2023

Codecov Report

Merging #3794 (09f986a) into master (028b85f) will increase coverage by 0.06%.
Report is 2 commits behind head on master.
The diff coverage is 87.71%.

@@             Coverage Diff              @@
##             master    #3794      +/-   ##
============================================
+ Coverage     47.32%   47.39%   +0.06%     
- Complexity    10955    10989      +34     
============================================
  Files          2152     2156       +4     
  Lines         85114    85179      +65     
  Branches       9451     9469      +18     
============================================
+ Hits          40279    40368      +89     
+ Misses        41184    41154      -30     
- Partials       3651     3657       +6     
Files Coverage Δ
.../kafka/validator/OrcSchemaConversionValidator.java 100.00% <100.00%> (ø)
...or/extract/kafka/validator/TopicNameValidator.java 100.00% <100.00%> (ø)
...or/extract/kafka/validator/TopicValidatorBase.java 100.00% <100.00%> (ø)
...in/source/extractor/extract/kafka/KafkaSource.java 6.33% <50.00%> (+0.20%) ⬆️
...actor/extract/kafka/validator/TopicValidators.java 92.00% <92.00%> (ø)
...pache/gobblin/util/orc/AvroOrcSchemaConverter.java 56.86% <50.00%> (-1.48%) ⬇️

... and 19 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@wsarecv wsarecv changed the title Add OrcSchemaConversionValidator [GOBBLIN-1928] Add OrcSchemaConversionValidator to avoid infinite recursion in AvroOrcSchemaConverter.getOrcSchema() Oct 10, 2023
@wsarecv wsarecv changed the title [GOBBLIN-1928] Add OrcSchemaConversionValidator to avoid infinite recursion in AvroOrcSchemaConverter.getOrcSchema() [GOBBLIN-1928] (will rebase once #3793 is commited)Add OrcSchemaConversionValidator to avoid infinite recursion in AvroOrcSchemaConverter.getOrcSchema() Oct 11, 2023
@wsarecv wsarecv force-pushed the tqin/schemaValidator branch 2 times, most recently from adbc308 to e0bd51d Compare October 18, 2023 22:09
@wsarecv wsarecv changed the title [GOBBLIN-1928] (will rebase once #3793 is commited)Add OrcSchemaConversionValidator to avoid infinite recursion in AvroOrcSchemaConverter.getOrcSchema() Add OrcSchemaConversionValidator to avoid infinite recursion in AvroOrcSchemaConverter.getOrcSchema() Oct 18, 2023
@wsarecv wsarecv marked this pull request as ready for review October 18, 2023 22:30
@wsarecv wsarecv force-pushed the tqin/schemaValidator branch from 19bae2d to 55b8951 Compare October 18, 2023 22:38
// Try converting the avro schema to orc schema to check if any errors.
int maxRecursiveDepth = this.state.getPropAsInt(MAX_RECURSIVE_DEPTH_KEY, DEFAULT_MAX_RECURSIVE_DEPTH);
AvroOrcSchemaConverter.tryGetOrcSchema(schema, 0, maxRecursiveDepth);
} catch (StackOverflowError e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you plan to catch the StackOverflowError anyway, why do we even need to introduce the depth here?

public boolean validate(KafkaTopic topic) throws Exception {
LOGGER.debug("Validating ORC schema conversion for topic {}", topic.getName());
try {
Schema schema = (Schema) this.schemaRegistry.getLatestSchema(topic.getName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema from schema registry is not the final schema, especially we do have converter to change the schema there and some times remove the recursive schema to enable ingestion, so I believe we should use converted schema to do validation here

int maxRecursiveDepth = this.state.getPropAsInt(MAX_RECURSIVE_DEPTH_KEY, DEFAULT_MAX_RECURSIVE_DEPTH);
AvroOrcSchemaConverter.tryGetOrcSchema(schema, 0, maxRecursiveDepth);
} catch (StackOverflowError e) {
LOGGER.warn("Failed to covert latest schema to ORC schema for topic: {}", topic.getName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's possible to detect which filed is recursive and include in this error message? so it will be easier for us to find the problematic filed and re-enable the ingestion when necessary

AvroOrcSchemaConverter.tryGetOrcSchema(schema, 0, maxRecursiveDepth);
} catch (StackOverflowError e) {
LOGGER.warn("Failed to covert latest schema to ORC schema for topic: {}", topic.getName());
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should emit some event to indicate this kind of error to avoid silent failure here. Same for other validator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants