-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CMSSW 15 won't support microarch-v2 #12168
Comments
Adding an email from antonio about this:
|
Thanks @mmascher More discussion is ongoing as sites show concern if their people can't analyze Run 3 data with newer CMSSW. The current proposal is at the end of the comment. (2) is on WM+SI part ========== (1) Building CMSW with Dual Version Compatibility (v2 + v3):
(2) Central production:
(3) Validation:
|
Hello, as @srimanob mentioned, we discussed this in detail during yesterday's core sw meeting and, as @srimanob mentioned, CMSSW_15_0_X will support dual micro-arch for x86_64 with x86-64-v3 being the default micro-architecture. How can we make sure that Central Production/RelVals use |
@mmascher (there's no hook for Antonio here). Please link to the discussion issue in SI mentioned in the mail from Antonio Of course support is needed for analysis jobs as well. Now we multiple threads on this same topic (this, the one in CRABClient mentioned above, the SI one, discussion in CMSSW https://indico.cern.ch/event/1482012/ ). I'd rather like a single document where people comment and which quickly converge to the specification for the solution. |
@belforte , why not just use this issue as a single doc then :-) About "mapping between CMSSW versions and desired micro arch", I think it will be easier if SCRAM set this via an environment variable. This way you do not need to hardcode this mapping in WM. e.g in CMSSW_15_0_X scram can set For analysis jobs, where some users might want to use In case these env variable are not set ( e.g for CMSSW_14_2 and earlier) then just do what we are doing now. FYI @makortel |
Assuming that I worry about the overall picture. Of course we should try to make it fully transparent for as many users as possible. Hopefully only those who develop for CMSSW_15 on v2 machines will have to do something. So I have to worry about what other changes to do in CRAB besides the above. my questions are more like
|
Another question to PdmV (@AdrianoDee @DickyChant @miquork): on what sites are the RelVals being run nowadays? In yesterday's core software meeting we had a feeling that the solution on WMCore side to limit production (including RelVal) jobs to x86-64-v3 might not be in place by CMSSW_15_0_0_pre1 (scheduled to be built on December 10), where core software is presently planning to switch the default from v2 to v3 (and where this change would be validated). We assumed PdmV would be able to add the necessary resource matchmaking criteria at the submission time, but then wondered if already the set of sites (e.g. FNAL?) used for RelVals would, in practice, guarantee x86-64-v3-only hardware. |
@makortel currently only at FNAL. We could re-include CERN now that the data taking has ended. But we have no urgency or specific need to do so (FNAL resources are more than enough). |
stupid Q. What about |
yes. Any exe build for vN should be able to run on vN+x ( where x>=0). So v2 exec should be able to run on v2, v3 and v4 |
Yeah, today we run v2 binaries on all v2, v3, and v4 hardware. |
I know that what Matti asked was not a question meant for CRAB. But since CRAB wrapper set ups the env. for cmsRun with same tools as WMA does, I looked at the code and confirmed my observation that a developer area is created and then the environment settings from |
Thanks @belforte. So both WMA and CRAB behave similarly in this regard. @smuzaffar Reading your slide 5 again from yesterday, do I understand correctly that in the case that a job lands on a v2-only node and sets up the developer area, that developer area would set up the full multi-arch behavior including the selection of best microarchitecture (v2 in this case) by scram? |
Yes @makortel , (though not implemented yet but) idea is that if one creates a dev area on v2-only node then scram should automatically
All cmssw jobs should work fine as long as any part (shared libs/plugins) of cmssw was not (re)build in dev area before submitting the job. So if Central Production/Relval jobs just create cmssw dev area and run cmsDriver/cmsRun then it should work fine (regardless of the micro-arch of the node where it runs). Problem I see is for crab jobs where user do checkout part of cmssw and build and submit. If someone submits a job with |
Stefano's conclusions are correct
However, the technical details are slightly different. WMCore runs cmssw from WMCore/WMSpec/Steps/Executors/CMSSW.py, and uses Scram.py for the pset tweak and a bash script inside a popen for cmsRun. An example of the arguments is [1]. Long story short, wmcore creates a developer area
then loads into the env the output of
then executes cmsRun
[1]
|
@makortel FYI, Dario provided a reply for your question targeting WMCore |
By the way, any development from WMCore on this topic will likely materialize in the next quarter, since IMU we still lack a common consensus about where to implement the binding to the specific microarchitecture, and our effort is committed to finalizing Q3 dev issues and urgent operation activities. According to the discussion here, I would support @belforte's proposal, and I am not convinced WM is the right place to introduce this binding. |
Thanks @smuzaffar for your summary. Let me rephrase from our (WM) POV to make sure I understood:
|
Hello all, Just for my understanding, is this a use case that will happen very often, or is it something that, in general, is going to change rarely? I see WMAgent and CRAB are defining respecively:
It's very easy to add a job's requirement like (for example):
We can possibly turn this into a "table" with CMSSW version and list of supported achitecture with some nested if. I would still provision machines solely based on DESIRED_Sites, otherwise Factory ops will need to check all the worker nodes at all sites, and update the static factory xml configuration file with the information about sites. |
thanks @mmascher , your suggestion will work if we only allow to use the default micro-arch of a release. So yes for CMSSW_15_0 and above it can always request resources with But we also need to allow running |
yes @mmascher , as @smuzaffar said CRAB situation is trickier. I hope too that we can avoid using the microarch in the provisioning, but if some jobs have a requirement for v2 and v2 is a scarce resource, it may not work. See GPU e.g. We need to have some understanding of where those v2's are to define policies that will work for us. I'd rather not try to handle this "a priori". |
@belforte , I was thinking the same and I think scram env variable can help you there. Instead of setting a comma separated list , scram can just set the min require micro-arch |
@mmascher , I see this in the
So how hard it is to add support for We want to enable multi-arch with x86-64-v3 as default for 15.0.0.pre1 (which should be built on 12th Dec). If this can not be implemented before 15.0.0.pre1 then I am fine with the CMSSW version to microarch mapping
At least this will allow us to validate 15.0.0.pre1 for |
@belforte , we can do do something similar to ARCH, i.e., se in the job ad:
or
Then add the necessary matchmaking expressions in the machine start expression. |
thanks @mmascher , this will help. |
the problem with that list is not just that it is long and hard to read (well...once coded it will be consumed by machines, when debugging is not needed
? |
thanks @belforte , let me try |
Thanks @smuzaffar for the detailed description in #12168 (comment). I'd suggest to document this behavior somewhere more permanently (e.g. twiki or https://cms-sw.github.io/), and make it more clear what action exactly makes a developer area on |
@mmascher @amaltaro we need to finalize names :-) I propose to use |
Looking into https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py, we already have a mix of REQUIRED and DESIRED job classads. |
Is this name something that in the future could work for ARM or RISC-V microarchitectures, that may have different nomenclature (e.g. not necessarily "versions"), or is this thought to be specific for the x86 case we have at hand now? |
Ideally, yes. I think the only requirement is that we can use it with an integer, such that >, < comparisons are easily applied. |
@makortel you tell us ! In case you should also generalize If a version # is not guaranteed, I do no see alternatives to a list. Or a "method" which creates the list on the fly given the "MIN" name. Hard to decide on anything w/o concrete examples. |
A challenge is that we don't know how we would want to deal with multiple ARM or RISC-V microarchitectures (and we don't have any real experience on either, so deciding anything now would be premature). I think this implies that if you want to have a general solution that wouldn't have be changed ever, we'd need to spend more time to think what that would be (and there would be a risk of overengineering). If we proceed with that we know for x86(which is perfectly fine for me), I'd suggest it's made clear the behavior is specific to x86, and it is understood that if we ever want/need to deal with multiple ARM/RISC-V microarchitectures the machinery will be reassessed. |
I fully agree. Let's stick with what is simplest now. And when we will now better, reassess. In optimistic spirit, let's avoid to put X86 in the name at this time. |
answering my question about "which are v2-(almost-)only sites". I used
|
Hallo Stefano, |
@makortel @smuzaffar about "generalization" this is what drives current CRAB implementation which I am working on:
it will be easy for CRAB to change the algorithm, but I'd rather not change the above logic nor the string format. I.e. if you think that you may like to use @mmascher please confirm if |
Hi @stlammel , sure. It is indeed curious that FNALLPC is v2-only ! Anyhow the only conclusions which I derive from that table are:
In other words, we keep doing things "as now", simply jobs which require v3 will see some sites as "smaller" and may be queued longer In the global view of things, T3's are almost only used for user MC and sort of "bundled together" |
fully agree Stefano! - Stephan |
As mentioned in this comment: I was considering to use it as a single integer, such that comparison/evaluation in glideinWMS can be simple. However, if @mmascher and others think that a construction like |
Very easy for me to set a integer too. Waiting for Marco to say what he wants!
|
Ok, let's go with the integer solution. Here is the constraint, I gave it a test.
It could be either added as a machine requirements in the glideinWMS frontend, or as a job requirement by WMagent when it writed the JDL. Should I add it to the frontend? |
thanks @mmascher . My preference is that match making requirements are set in the gWms, like done e.g. for DESIRED_Sites. Having reuquirements in multiple places is confusing. |
Hi @belforte , If we go with REQUIRED_MINIMUM_MICROARCH as an integer I would not set "any" as default. The final constraint I tested in ITB has a protection for REQUIRED_MINIMUM_MICROARCH being defined. So you can either let it undefined, or set it as 0 as a default behaviour. Either way all slots will be matched. |
@mmascher one more question. Should we define this job classad only if the cpu architecture is From what I can tell, production jobs could be requesting multiple architectures: will the settings of microarchitecture keep this functionality intact? In other words, a job setting for example:
will it still manage to match against either resource type? |
Hi @amaltaro |
I can add a protection for the machine microarch as well. As far as I know it is not defined for aarch64, so in this case we can allow jobs to run regardless of their REQUIRED_MINIMUM_MICROARCH. Does this sound ok? |
Yes Marco, I think this would be ideal for the moment. In other words, the desired behavior I have in mind is:
|
that's implicit in the fact that we specify microarch with a number. Should at some point we need sam afunctionality for Arm, or whatever else, even if still a number, will likely be a different one. |
the agreed behavior has been implemented in CRAB, and deployed on production server, but it is only accessible using |
Impact of the new feature
It was decided at the O&C week that starting from CMSSW 15 we won't build CMSSW for microarch v2. v1 has already been excluded and is not supported anymore. We need to make sure that jobs that will have a requirement for microarch x86_64-v[34] when they need to run on CMSSW15.
Additional context
Microarch can be found on the slots as:
The text was updated successfully, but these errors were encountered: