Skip to content

Latest commit

 

History

History
113 lines (100 loc) · 14.2 KB

2022-11-09.md

File metadata and controls

113 lines (100 loc) · 14.2 KB
title parent
2022-11-09
Meetings

ASWF CI Working Group

Meeting: 09 November 2022

https://zoom-lfx.platform.linuxfoundation.org/meeting/97730566101

Attendees

  • Jean-Francois Panisset (VES Technology Committee)
  • Jean-Christophe Morin
  • Larry Gritz (Sony Imageworks)
  • Andrew Grimberg, LF Release Engineering
  • Ryan Botriell, ILM
  • Jeff Bradley, Dreamworks
  • Viven Lyer
  • Christina Tempelaar-Lietz, ILM
  • Esteban Papp, AWS
  • Aloys Baillet, NVIDIA
  • John Mertic, LF

Apologies

New items

  • Update on GitHub Actions for pay runners (Andrew)
    • Available GPU builders:
      • ubuntu-20.04-gpu-6c-112g-336h-16vr
      • ubuntu-20.04-gpu-12c-224g-672h-32vr
      • Ubuntu-20.04-gpu-24c-448g-1344h-64vr
    • specifiers are:
      • Xc == num CPU
      • Xg == GB of RAM
      • Xh == GB of disk
      • Xvr == GB of VRAM
    • Each of these is using NVIDIA Tesla V100 GPUs. 1x, 2x, and 4x (each GPU is 16vr, so you can do the math from that if you want). Like the other runner pools, I have these capped at a maximum of 2 concurrent at any time
    • Costs so far? Last we had was that we are free until end of November, GPU runners still don't have a cost associated with them. Did get info as to why we saw build minutes for some projects past 6hrs, they only report the number of minutes for a specific project on a per day basis, but we don't get a count on how many times it is run, unless we look at the project logs.
    • Even on custom runners, maximum job time is 6hrs.
    • OpenVDB 10.0 release: "The release process is much smoother this time around than the last time, especially because the CIs are running a lot faster than last year.". Useful to justify cost when we start having to pay real money.
    • When does the free period expire? Still Nov 30th?
    • GPU runners? nvidia-smi and other tools inside build containers? Aloys: use nvidia-provided containers as a base, so will check if nvidia-smi is in there. Larr: was in context of OSL builds. Trivial issue was that ASWF containers don't have any of the driver stuff in it. Blocker is that if we don't use the containers (don't use for some of my builds), runners may not have permission to install some of the dependencies. Aloys: the ASWF docker images are based on cudagl images from NVIDIA, they have built in symlink to the host driver. If you run with the NVIDIA "execution driver", have to install a Docker extension to communicate with the GPU. I think it's documented on the ASWF docker images. You can't install the NVIDIA driver within the Docker container / image, but inside the container, there's enough infrastructure to symlink to the host. Larry: may not have been the driver per se, but there were some libraries and utilities. Andy: th hosts we are getting definitely have the GPU driver installed. Tests were showing that the driver was present, but failing due to trying to install something they didn't have permission to. Don't have a resolution, GHA team pointed at the errors, but didn't offer a solution. Larry: not a critical blocker, but would like to have "real" GPU tests as part of OSL CI.
    • Aloys: used to be able to run OSDview on those CI images. Larry: but OSDview only requires the GL part, whereas OSL needs CUDA libraries and extensions, using as computational engine. So there may still some missing components. Aloys: haven't done heavy CUDA, but have been able to run simple CUDA. Those images are meant to be run as Docker CUDA compute images. Larry: I need to refresh my memory as to what was not working. Aloys: those base images haven't been re-released in a while, will try looking at that for Rocky Linux since we will need that soon. WIll spend a bit more time revisiting, in principle it should work, please create a GitHub Issue so we can track it. Larry: will do that.
    • Matrixing to support "personal build"?
      • Anyone has had any luck with that? Larry: verified that if you submit a job where the workflow needs one of the paid machines and you trigger it from your forked account, it will just wait forever for a machine to become available that will never become available. It doesn't just bypass the test.
      • Andrew: will tell GitHub Actions that proposed solution does not work.
      • Larry: have a workaround with jobs that only run whether on the main repo vs on a fork, it's an OK workaround, but cluttered. Not a great solution, would be better if there was an official way to fallback to the non paid runner solution.
    • What project gets what going forward?
      • Bigger CPU runners: OpenVDB, ASWF Docker
      • GPU: OCIO, OSL, OpenVDB for NanoVDB
      • All ASWF projects currently have access, probably best to leave it open, project maintainers would have to approve changes to the CI.
      • JC: if your workflow is supposed to run on PR... Andrew: most likely person to do something like that would be a first time contributor, and there are checks against that.
      • Christina: how is running on those runners determined? Andrew: "runs-on" in the workflow, we've documented that in the Slack channel.
  • Aswf-docker update (Aloys)
    • VFX Platform 2023 is finalized, OCIO 2.2.0 and OpenVDB 10.0.0 made the deadline
    • Outstanding PRs
    • Aloys: haven't started working on VFX Platform 2023, there are a few outstanding issues I would like to address
      • Blosc version update for OpenVDB
      • Broken symlinks inside vfxall due to TBB libraries
      • Some python 3 invalid shebangs
      • Have local fixes
    • Aloys: hoping to start releasing new versions of 2022 this week.
      • Update USD, MaterialX
      • Have a changelog
      • Once these are sent out, will likely be last update for 2022. Larry, should I update OIIO? Larry: latest one would be great, have had a major release recently. We get llvm and clang from Conan, right? That doesn't depend on being in the container? Aloys: have clang14. Larry: 15 is out, they keep an aggressive release schedule. Aloys: might add that while building, building clang is quite fast on those new fast machines. Will try to do it before end of month to use free minutes! That's the plan for 2022
    • For 2023, hoping to leverage NVIDIA cudagl images, haven't been released for Rocky Linux yet, should happen, will check with developers. They have released a number of CUDA based images yet. They will probably be quite a few, will take a while to rebuild everything. Don't know if anyone is waiting / trying to build with Rocky Linux, how many differences there are. Might be nothing, might be lots of issues. Will be an adventure. In the next few weeks.
    • JF: RHEL 8.7 released today, includes gcc 12.
    • Aloys: devtoolset is something I need to "sideload" in the ASWF docker images. Some investigative work needed. There's a version of gcc specified for VFX Platform, but not specific devtoolset version.
    • Christina: is Rocky Linux what people are moving to? Rocky, Alma or RHEL
    • Aloys: NVIDIA already started publishing cudagl images on Rocky 8, so that's likely what I'll be using.
    • Aloys: any last minute requests for 2022, let me know so I can include those in the release. Will check up on OSL.
    • Installing Docker and The Docker Utility Engine for NVIDIA GPUs — NVIDIA AI Enterprise documentation
    • LFX Insight for ASWF Docker project
  • Updates on "small" project hosting and dependency spreadsheet (Larry)
    • TAC discussion seemed to say it was fine for CI WG to take on small project, don't know that we have something ready to go, but if such things come along, common infrastructure, and feel this WG is the right place, we can make code, we are the group that handles common infrastructure.
    • Still getting updates to the spreadsheet, some studios have filled it out, still waiting for others. Most ASWF projects have filled out there sections, OCIO had a big release so they still have to get to it. Lots of interesting information to digest in it. In some places, how many VFX Platform years are still in play, don't always know what to make of that, studios still building internally to support long running shows. There are a few oddball things, not necessarily practical to say that everything "in play" has to be supported. Interesting to see facilities trying to restrict to a few versions, vs all over the place. Jeff: as a "all over the place" studio, don't expect updates to older versions. Larry: we all know what it's like in the trenches. Jeff: everyone wants to move forward! Larry: had a discussion about this on EXR list, was musing about the fact that most projects don't have a lot of resources to maintain ongoing fixes to versions that are years old. Is it better to put effort into making it easier to update without compatibility issues. If "using the latest" is a workable answer, that could be a way to go forward. A lot of projects have solved their problem by making a namespace that creates a 100% incompatibility of ABI between yearly releases. Trying to make it people dependent on us to update, even if stuck on older releases.
  • CII Silver Badge Requirements review (John)
    • Link to document: Analysis of OpenSSF Badge requirements for ASWF projects
    • Larry: pulling up projects to look at.
    • Password must have a passing level badge
    • Basic project website content: Most of our projects are very good at this, shouldn't be a problem.
    • Project oversight: project should have a legal mechanism. All our projects use a CLA (?) and DCO
    • Project MUST clearly define a governance model. Larry: we make this a requirement.
    • Code of conduct: every project has one by default, the one from LF Projects LLC
    • Document key roles, document committers
    • Must be able to continue if a single person is no longer available: don't think this is a problem for any specific project. Only one person has keys to the kingdom: LF has backups of everything, in terms of day to day flow, having people from different companies, usually it's not an issue? Andrew: lf-releng is 10-20 people, GitHub manages secrets, so that part is OK. All projects hosted out of GitHub? All projects managed out of GitHub pages, OpenCue uses Netlify, and we own all the DNS domains. Tracked some stragglers from OpenEXR recently.
    • Documentation: must have a documented roadmap, what the project intends to do and not do for the next year. Allowed to change roadmap of course. Larry: most projects may not have, but they should. It's probably "in their head". Doesn't have to be detailed.
    • Documentation of the architecture: not sure how many projects have this, but probably not everyone? Larry: probably few do, but all should. My projects really should!
    • Project must document what user can expect in terms of security. Might help some of the project security scope. JF: may need to already know about security issues in general to scope what is in scope and not. Larry: interpret as "what support are we willing to give". Some projects have a security readme, promise we'll address CVEs to best of abilities in 30 days, what we will patch. What users can and can't expect. JC: rez executes Python files, if you use the product, be aware that these are possible issues. Does this cover it? Document what is outside the control of the project, tell the users what they can expect in terms of security.
    • Project must provide QuickStart guide: help users so they can do something quickly with the software. We have some projects that do this? Have seen at least a couple of projects do this.
    • Keep documentation consistent with current version. Most projects are using something like readthedocs, building docs in parallel with code.
    • Project should have a link to badge on GitHub repo and website. JC: the bad is the CII badge? John: yes, there's HTML code you can drop in, auto updates based on CII badge status.
    • Accessibility best practices to be followed. Larry: probably no one knows how to do this?
      • Web applications: only project that has a web app is OpenCue? JC: OTIO doesn't have a webview yet.
      • Most of our projects are libraries or CLIs, so mostly doesn't apply for our projects.
      • Larry: we have "incidental" GUIs, such as developer debugging tools, but not the primary end user app.John: the concept is that we want the application to be as accessible to people with disabilities as possible. There are guidelines which might be helpful.
      • Also want to have projects set a standard, thinking about accessibility. But less of an issue for back end projects.
    • Internationalization: if applicable. Mostly around languages. Larry: that's a big requirement. All our software generates error messages, we don't have translations. JC: for terminal based applications, doesn't the terminal do this? But you need translations. Andrew: but only if the language files exist. Using i18n library for outputting messages, so someone else could provide translations. Larry: most of our developers have never had to do something like that. Andrew: that's why it's a SHOULD, not a MUST. Better localization helps adoption. Larry: not arguing that it's useful. But most project came from internal projects from not always large companies.

Follow Ups

Tools and Links