Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native RISC-V nodes to OMR CI testing #7530

Closed
0xdaryl opened this issue Nov 6, 2024 · 17 comments
Closed

Add native RISC-V nodes to OMR CI testing #7530

0xdaryl opened this issue Nov 6, 2024 · 17 comments
Labels

Comments

@0xdaryl
Copy link
Contributor

0xdaryl commented Nov 6, 2024

Eclipse now provides limited access to native RISC-V nodes, upon request. Details here [1].

We will have to evaluate whether these are suitable for native builds and test, or just as test nodes.

[1] https://github.com/eclipse-cbi/cbi/wiki#whats-provided

@janvrany @AdamBrousseau @jdekonin FYI

@0xdaryl 0xdaryl added the ci label Nov 6, 2024
@AdamBrousseau
Copy link
Contributor

https://github.com/eclipse-cbi/jiro/wiki/Dedicated-build-agents

Riscv64 servers based on VisionFive2 SOC boards, with 8 GB RAM, 4 cores, 960 GB SSD Nvme storage. At this point, we estimate that each machine can host up to 4 containers with oversubscription.

Containers are delivered with latest (at the moment of container creation) Ubuntu https://hub.docker.com/r/riscv64/ubuntu/ and with the following tooling:

Temurin JDK 21 LTS https://adoptium.net/en-GB/temurin/releases/?arch=riscv64&version=21
Maven 3.9.9
Ant 1.10.5
Additional packages installed:

build-essential
libboost-all-dev
libssl-dev
libgtk-3-dev
libglu1-mesa-dev
libgtk-3-dev

This will be mostly on a first-come-first-serve basis. If more projects need more compute time on riscv64, we will need to delegate projects to cloud services like Scaleway at some point.

Generally we use the openj9 playbooks to setup omr machines because that is a superset of the tools needed for OMR builds. Do we have a minimal set of tools required documented somewhere? Perhaps we can craft our own container that eclipse will host for us. Alternatively we can install the necessary tooling in the container on the fly but that will require a bit of build work to get going.

For the record, we have riscv builds that run in qemu(?) on x64 linux. They have been disabled since May 2024. I cannot recall why exactly. Looks like the 2 machines that can run those were offline for a while. When the builds were running they were taking about an hour to complete. I have kicked off a build to see if it still works.

https://ci.eclipse.org/omr/job/Build-linux_riscv64_cross/494/

@0xdaryl
Copy link
Contributor Author

0xdaryl commented Nov 6, 2024

I cannot recall why exactly.

They were consuming a lot of memory and were falling over intermittently. I thought there was an issue created for this, but can't find it at the moment. What I recall, the harness running the OMR compiler during test is leaking memory between compilations. While there are hacks to fix this, as we discussed somewhere the preferred way to fix it is to understand all the holes and plug them.

@janvrany
Copy link
Contributor

janvrany commented Nov 6, 2024

These builds were disabled because they were unreliable. The exact cause was never found, but my suspicion is excessive memory consumption caused by some interference with OMR's code cache and QEMU's TCG.

@janvrany
Copy link
Contributor

janvrany commented Nov 6, 2024

Here's what I installed on my RISC-V machines in order to compile and test both OMR and OpenJ9:

https://github.com/janvrany/debian-for-toys/blob/master/common/mk-fs-hooks/customize50-dev-tools.sh

But the list above is not minimal. @jdekonin wrote a Dockerfile used to build cross-compilation environment that was used in the - now disabled - builds. Here's the relevant bit:

https://github.com/eclipse-omr/omr/blob/master/buildenv/docker/riscv64/debian11/Dockerfile#L81

Note, that you need riscv.h and riscv-opc.h:

https://github.com/janvrany/debian-for-toys/blob/master/common/mk-fs-hooks/customize50-dev-tools.sh#L37-L38

@AdamBrousseau
Copy link
Contributor

Do we expect real hardware to be more reliable then since it won't be using qemu?

@AdamBrousseau
Copy link
Contributor

@janvrany
Copy link
Contributor

janvrany commented Nov 7, 2024

Do we expect real hardware to be more reliable then since it won't be using qemu?

I would, but I never hard problem with QEMU either (but the machine had/has 16GB RAM).

Speaking if RISC-V hardware, I have the exact same board for some time and never had that kind of problems (was running jobs on it since "old" job was disabled just have a feeling if things are still okay). I'm not running it now because
it is consistently failing because of Python - see #7496 and #7279. Once that is resolved, I'll enable it again and can watch it more closely if that helps.

@janvrany
Copy link
Contributor

It seems that RISC-V node is connected: https://ci.eclipse.org/omr/computer/riscv-build2/

Is there anything I can do to make use of it?

@AdamBrousseau
Copy link
Contributor

We likely want to add a new spec to the jenkins file that isn't *cross

'linux_riscv64_cross' : [

I can add new builds to Jenkins.
The new spec won't be happening in Docker so don't add it to this line
https://github.com/eclipse-omr/omr/blob/ef89b64bcaf398492b7c748135b226ccbf76940b/buildenv/jenkins/omrbuild.groovy#L47C2-L47C56
For the machine label I would use hw.arch.riscv64

@janvrany
Copy link
Contributor

@AdamBrousseau I have added the spec as suggested, see #7556. I cannot really test it though (or do not know how).

@janvrany
Copy link
Contributor

janvrany commented Dec 2, 2024

@janvrany
Copy link
Contributor

janvrany commented Jan 6, 2025

Both #7556 and #7576 are merged so RISC-V native builds should be working now.

Should https://ci.eclipse.org/omr/job/Build-linux_riscv64/ job be enabled now?

@AdamBrousseau
Copy link
Contributor

Slow but it passed
https://ci.eclipse.org/omr/job/Build-linux_riscv64/1/
Thanks for all your efforts on this @janvrany!

@janvrany
Copy link
Contributor

janvrany commented Jan 6, 2025

Yeah, compilation seems slow - looking at my system, compilation takes ~10 mins (no ccache, storage over NFS).

Maybe ccache would help but that would need more work. I noticed that ccache is actually not used on any of the cmake-based builds, I'll open an issue tomorrow.

@janvrany
Copy link
Contributor

janvrany commented Jan 7, 2025

I'll open an issue tomorrow.

Done: #7599.

Anything else or can we close this issue?

@AdamBrousseau
Copy link
Contributor

I think this can be closed. @0xdaryl

@0xdaryl
Copy link
Contributor Author

0xdaryl commented Jan 9, 2025

Thanks for getting this working @janvrany!

@0xdaryl 0xdaryl closed this as completed Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants