Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-37192] [pyflink] Replace deprecated avro-python3 with avro #26008

Conversation

mina-asham
Copy link
Contributor

@mina-asham mina-asham commented Jan 17, 2025

What is the purpose of the change

Replace deprecated avro-python3 with avro

  • avro-python3 was deprecated and replaced by avro, the first hasn't had a release since March 17, 2021 while the second has had multiple fixes and updated, latest is August 5, 2024
  • Both libraries are the exact same (i.e. avro-python3 was just renamed to avro and had multiple updates since), but the problem is that they overlap in package name, so using PyFlink with any updated library that relies on avro fails starting the pipeline even if the pipeline doesn't actually do any avro encoding/decoding
  • This updates the library to the latest one, and fixes a few imports, other than that the library's functionality is exactly the same

Brief change log

  • Replace avro-python3 (deprecated since 2021) with avro in PyFlink

Verifying this change

This change is already covered by existing tests, such as all the existing avro tests

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): yes
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@mina-asham mina-asham force-pushed the minaasham/upgrade-replace-avro-python3 branch from 7c11f44 to 8c67bb7 Compare January 17, 2025 14:13
@flinkbot
Copy link
Collaborator

flinkbot commented Jan 17, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@dianfu
Copy link
Contributor

dianfu commented Jan 21, 2025

@mina-asham Thanks for the PR! LGTM overall. Could you create a JIRA ticket and also rebase the PR to fix the code conflict of file setup.py?

@mina-asham mina-asham force-pushed the minaasham/upgrade-replace-avro-python3 branch 2 times, most recently from 8d2543b to 84398d2 Compare January 21, 2025 09:26
@mina-asham
Copy link
Contributor Author

mina-asham commented Jan 21, 2025

@dianfu thanks for reviewing!

Could you create a JIRA ticket

I am still waiting on my JIRA account being created, is it possible to help get this approved? I applied last Thursday (16th of Jan)
Update: Account was just created, created the issue here: https://issues.apache.org/jira/browse/FLINK-37192 and update the PR

also rebase the PR to fix the code conflict of file setup.py?

✅ Done

@mina-asham mina-asham force-pushed the minaasham/upgrade-replace-avro-python3 branch from 84398d2 to 6239582 Compare January 21, 2025 12:35
@mina-asham mina-asham changed the title [FLINK-TODO] [pyflink] Replace deprecated avro-python3 with avro [FLINK-37192] [pyflink] Replace deprecated avro-python3 with avro Jan 21, 2025
@mina-asham
Copy link
Contributor Author

@flinkbot run azure

@dianfu
Copy link
Contributor

dianfu commented Jan 23, 2025

@mina-asham Thanks for the update! These are test failures, could you take a look?

- avro-python3 was deprecated and replaced by avro, the first hasn't had a release since March 17, 2021 while the second has had multiple fixes and updated, latest is August 5, 2024
- Both libraries are the exact same (i.e. `avro-python3` was just renamed to `avro` and had multiple updates since), but the problem is that they overlap in package name, so using PyFlink with any updated library that relies on `avro` fails starting the pipeline even if the pipeline doesn't actually do any avro encoding/decoding
- This updates the library to the latest one, and fixes a few imports, other than that the library's functionality is exactly the same
@mina-asham mina-asham force-pushed the minaasham/upgrade-replace-avro-python3 branch from 6239582 to f93d04a Compare January 23, 2025 16:17
@mina-asham
Copy link
Contributor Author

@flinkbot run azure

@mina-asham
Copy link
Contributor Author

@mina-asham Thanks for the update! These are test failures, could you take a look?

@dianfu fixed, thanks for catching this

@@ -318,7 +318,7 @@ def extracted_output_files(base_dir, file_path, output_directory):

install_requires = ['py4j==0.10.9.7', 'python-dateutil>=2.8.0,<3',
'apache-beam>=2.54.0,<=2.61.0',
'cloudpickle>=2.2.0', 'avro-python3>=1.8.1,!=1.9.2',
'cloudpickle>=2.2.0', 'avro>=1.12.0',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be slight differences between the old avro-python3 and the new preferred avro packages. I assume the other changes in this pr relate to the slight changes. The original code seems to be tolerating a range of versions, do we know why? Or is the latest fine?

I am curious whether fastavro needs to be updated to be compatible with avro 1.12.0 in any way. fastavro 1.1 is 5 years old - should we push this to the latest as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be slight differences between the old avro-python3 and the new preferred avro packages. I assume the other changes in this pr relate to the slight changes. The original code seems to be tolerating a range of versions, do we know why? Or is the latest fine?

I think we are fine, basically avro-python3 was deprecated in favour of avro, it's the same package just renamed, but they also did some breaking changes in these updates (changing package names, slightly changing some names, etc...), that's what the code changes here account for. We still account for a range of versions here though, which afaik is a Python best practice so that you don't lock up consumers to a certain library.

I am curious whether fastavro needs to be updated to be compatible with avro 1.12.0 in any way. fastavro 1.1 is 5 years old - should we push this to the latest as well.

fastavro is completely unrelated (beats me with Flink is using two separate Avro libraries tbh), avro-python3/avro is a pure Python avro implementation, fastavro is written in C to offer faster execution. Might be worth using a single Python package for avro or upgrading avro, but I think that's a bigger effort and should be out of scope of this smaller change here.

@davidradl
Copy link
Contributor

Reviewed by Chi on 23/01/2025 Go back to the submitter with review comments.

@mina-asham mina-asham requested a review from davidradl January 24, 2025 16:32
@mina-asham
Copy link
Contributor Author

Reviewed by Chi on 23/01/2025 Go back to the submitter with review comments.

@davidradl thanks for reviewing, responded and re-requested review.

@dianfu
Copy link
Contributor

dianfu commented Jan 28, 2025

@mina-asham Thanks for the update. LGTM.

@dianfu
Copy link
Contributor

dianfu commented Jan 28, 2025

Will wait to see if @davidradl has other comments before merging the PR.

@dianfu
Copy link
Contributor

dianfu commented Feb 6, 2025

Merged to master via aef8c86

@dianfu dianfu closed this Feb 6, 2025
dianfu pushed a commit that referenced this pull request Feb 6, 2025
- avro-python3 was deprecated and replaced by avro, the first hasn't had a release since March 17, 2021 while the second has had multiple fixes and updated, latest is August 5, 2024
- Both libraries are the exact same (i.e. `avro-python3` was just renamed to `avro` and had multiple updates since), but the problem is that they overlap in package name, so using PyFlink with any updated library that relies on `avro` fails starting the pipeline even if the pipeline doesn't actually do any avro encoding/decoding
- This updates the library to the latest one, and fixes a few imports, other than that the library's functionality is exactly the same

This closes #26008.
@iamharbie
Copy link

Could anyone help clarify which version of the pyflink has this release? Both the recently released 1.19.2 and 1.20.1 still depends on the old avro-python

@mina-asham
Copy link
Contributor Author

Could anyone help clarify which version of the pyflink has this release? Both the recently released 1.19.2 and 1.20.1 still depends on the old avro-python

I think this will be available in the upcoming 2.X releases, but it hasn't been backported to 1.X AFAIK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants