Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⬆️ [email protected] #722

Merged
merged 42 commits into from
Apr 19, 2022
Merged

⬆️ [email protected] #722

merged 42 commits into from
Apr 19, 2022

Conversation

CNugteren
Copy link
Contributor

@CNugteren CNugteren commented Apr 1, 2022

What do these changes do?

This PR updates LCE to TensorFlow 2.8.0. Quite a few changes have been made, and most likely more are needed. I'll highlight some of the more controversial changes below in the form of a comment.

How Has This Been Tested?

All unittests have been ran locally. Fixes have been made as compilation, linking, or testing errors showed-up. However, some of them do not succeed yet, most notably the MLIR tests. This PR is therefore still a draft.

Benchmark Results

Latency has not been verified yet.

Related issue number

N/A.

Tasks

Open tasks before being able to turn the draft PR in an actual PR:

  • Resolve the failing MLIR tests
  • Deal with the deprecated make build system, translate to CMake?

@CNugteren CNugteren requested review from Tombana and lgeiger April 1, 2022 14:29
@CNugteren CNugteren added the feature New feature or request label Apr 1, 2022
Comment on lines +81 to +86
Operation* LarqDialect::materializeConstant(OpBuilder& builder, Attribute value,
Type type, Location loc) {
if (arith::ConstantOp::isBuildableWith(value, type))
return builder.create<mlir::arith::ConstantOp>(loc, type, value);
return nullptr;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required to enable constant folding with the switch to arith.ConstantOp. TensorFlow does something similar as well. See here for more information about this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the tensorflow example they first try to make it into a TFL::ConstOp before they try the arith::ConstOp. I guess we don't need that here because the larq dialect is used for TF ops and not TFL ops?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the tensorflow example they first try to make it into a TFL::ConstOp before they try the arith::ConstOp. I guess we don't need that here because the larq dialect is used for TF ops and not TFL ops?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand their code correctly it will create a tfl.pseudo_const if the value is opaque which isn't supported by arith::ConstantOp. I don't think this is a case that we will encounter. In general TFL::ConstOp will always be folded to arith::ConstantOp if possible so I think this should be fine to keep it like this for now.

lgeiger and others added 2 commits April 8, 2022 19:33
* Update converter to sync with upstream tensorflow

* Update larq_compute_engine/mlir/tf_to_tfl_flatbuffer.cc

Co-authored-by: Tom Bannink <[email protected]>

Co-authored-by: Tom Bannink <[email protected]>
@CNugteren
Copy link
Contributor Author

I've removed the old make-builds, and replaced it with the new CMake build system. I've only done something quite basic, building the library and the example and benchmark binaries. I tested the benchmark binary locally and that seems to work. So I think we are ready for a last round of reviews now.

@CNugteren CNugteren marked this pull request as ready for review April 11, 2022 16:00
@lgeiger
Copy link
Member

lgeiger commented Apr 11, 2022

I've removed the old make-builds, and replaced it with the new CMake build system. I've only done something quite basic, building the library and the example and benchmark binaries. I tested the benchmark binary locally and that seems to work. So I think we are ready for a last round of reviews now.

Awesome! 🚀

Just for reference, so we don't forget. Once this is merged and we've released a new version we will also need to update build docs.

@lgeiger lgeiger mentioned this pull request Apr 12, 2022
@lgeiger
Copy link
Member

lgeiger commented Apr 13, 2022

I triggered a test release, let's see how this goes.

@Tombana
Copy link
Collaborator

Tombana commented Apr 13, 2022

I triggered a test release, let's see how this goes.

The windows looks like it succeeded within 12 minutes but actually the build failed with

The target you are compiling requires Visual C++ build tools. 
Bazel couldn't find a valid Visual C++ build tools installation on your machine.

The linux builds fail because the manylinux toolchain bazel files can't be found; they've probably been updated to newer versions.

@lgeiger
Copy link
Member

lgeiger commented Apr 13, 2022

The windows looks like it succeeded within 12 minutes but actually the build failed with

I pinned the Windows VM version to 2019 in 5bf5748, let's see if this fixes the issue. Alternatively we could also change the BAZEL_VC path to make it work with Windows 2022, but not sure if we would run into other issues then.

The linux builds fail because the manylinux toolchain bazel files can't be found; they've probably been updated to newer versions.

I pinned the docker image to 2.8 in 3923692 which didn't seem to fix things.

It seems like the TensorFlow repo removed their toolchains from the repo. I copied the toolchain from the TF Addons repo in bdb6f20. They use the same docker container to build the wheels as we do, so I expect this to work.

I triggered a new build to test these changes.

@CNugteren
Copy link
Contributor Author

CNugteren commented Apr 15, 2022

Two builds failed:

Update: the linux build fails with:

ERROR: /root/.cache/bazel/_bazel_root/db801a7aca49c33ba961fa7cde3e7bf7/external/org_tensorflow/tensorflow/compiler/mlir/tensorflow/BUILD:573:11: Compiling tensorflow/compiler/mlir/tensorflow/ir/tf_ops.cc failed: (Exit 4): crosstool_wrapper_driver_is_not_gcc failed: error executing command 
(...)
gcc: internal compiler error: Killed (program cc1plus)

But in the latest build it is the 3.8 version of the linux build that fails and the 3.9 passes, so that seems random:
https://github.com/larq/compute-engine/runs/6036809834?check_suite_focus=true

@CNugteren
Copy link
Contributor Author

The timeout increase did not work, it still times out after 6 hours: https://github.com/larq/compute-engine/runs/6036809795?check_suite_focus=true. Perhaps that is some global setting or a Github Actions maximum? In any case, this time two out of the 3 Windows builds passed since they finished in five and a half hours, only one timed out. This behaviour was also seen earlier without TF 2.8, it was always tight with this 6 hours maximum. So I do not consider this related to this branch.

My proposal is to merge this in and then investigate making the releases more stable in a separate PR (Linux builds random failures and Windows builds time outs)? I can open an issue for that.

@Tombana
Copy link
Collaborator

Tombana commented Apr 19, 2022

I agree, let's merge it.

I think I've seen this before and in that particular case the system had run out of memory. That could be fixed by adding swap space or changing the number of jobs that bazel runs in parallel. But maybe this problem is different.

gcc: internal compiler error: Killed (program cc1plus)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants