-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WMAgent: Properly parse and handle DBS client erros #12229
Comments
Thanks for creating this issue, Todor. Here are some comments I have:
I would rephrase this to the DBS Server, given that the error message is generated and communicated by the DBS Server. The DBS Client is only the pipeline between our client and the server. Looking at the full error object you dumped above, I see a major inconsistency that makes it hard for the client to digest the actual error, they are:
Shouldn't the server provide an error code Second inconsistency I see comes from:
based on this error message, shouldn't the server actually report Looking at the DBS Server error codes, I would say we are only concerned about data/constraint related. By a quick look:
Most of the other errors are likely a server-side error and/or are not worth it to have a special handling (as I don't think there is any special action for them). So, recording a generic error message with the code/error/message provided by the server is probably sufficient enough. |
Alan, I would like you to read
So, it shows you that DBS error 101 shows it happens in The DBS server is very good at errors as it provides multiple stacktrace from API level to a down SQL layer. At each layer the appropriate error is shown. The client must properly parse the error message to understand the reason. Moreover, in the DBS go server errors has very well defined structure they obey, see https://github.com/dmwm/dbs2go/blob/master/dbs/errors.go#L84 In other words it is not arbitrary JSON but a data-struct which is serialized to JSON. This structure will always has |
I understand that the information is present in the response object, @vkuznet . However, the client usually expects a clear and concise reason for the error. The server is actually providing tons of information and with somehow redundant and/or confusing data. For instance:
Should I use this top level
And I ask again, what is the meaning of Then we also have the While I appreciate the richness of this error content reported by DBS Server, I feel like there is plenty of DBS expertise that needs to be embedded in order to parse it. Is there a documentation listing each of the fields (nested or not) and explaining what each one of them is? Which ones are relevant? Which field gives me the actual reason for the failure (without cascading failures)? For this example that we are discussing, the actual error must be |
Alan, here are definitions of various attributes in error structure:
In this case error happen in https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L508, in particular here is exact location of code block which fails:
The Maybe the better representation would be to use nested errors instead of
Otherwise just stick to use If you want to hide internal details of API failure, then modify relevant parts of DBS code to "hide" these details, e.g. change line https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L508
to be
But hiding information will certainly bytes you when you'll need to understand the reason why API fails to perform its action. Therefore, my suggestion is avoid "hide" approach as much as possible. |
Thank you for this explanation Valentin. In order not to lose this information, I think it would be great to document it somewhere more persistent than a GH issue/PR. Regarding the level of details and the full chain of errors, I would keep it because it might be useful at some point for some uncommon errors. However, what I believe to be wrong is the high level error reported in this response object. For instance:
this looks wrong to me. The generic database error should not have been the error code reported to the user, because this was just a side effect of not having found the parent file. A more appropriate error code would have been 130 instead. Even if we rely on the Anyhow, I think I am just repeating myself here... |
I'm glad that it is clear and we found a common ground. I agree that in this case a |
Impact of the new feature
WMAgent
Is your feature request related to a problem? Please describe.
This issue is a followup on our findings during this investigation: #11965
The list of the so foreseen at the time, development and operational issues was here: #11965 (comment) Out of those, all
OPS
issues have been already completed and from theDEV
issues only the first one has remained, since for the other two:DEV3: Debug and find why are we loosing the HTTP header when we switch to APS frontend - it was due to a bad combination/missing of values for the following flags at the APS fronend:
keepAlive && keepAliveTimeout
. It would be a task redirected the CMSWeb teamDEV2: Debug and fix the bug which caused the blocks overlap only in DBS - We will never know. The Error that caused the blocks scramble between datasets at T0 have never been found, the historical data was not enough.
So this issue is meant to cover only:
As found and explained here: #11965 (comment), we do not parse the full error returned by the DBS Server and encoded in the HTTP header. But we rather take into consideration only the final error code. A good example was the debugged case in the quoted comment above, and it clearly shows that the dbs client (which is a dependency of ours) properly gets a full error described by the following data structure: [1], which on top of everything also includes the DBS server stacktrace as well. It is clearly seen that many times at the top of the
stack
or even from the nested errors encoded in themessage
field at [1] sits some type of ORACLE based error which is later transformed to the properDBSError Code
(101
- missing parent information in this case). So the current issue is twofold:WMCore/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py
Line 108 in 76fd3a9
Describe the solution you'd like
Describe alternatives you've considered
Do nothing.
Additional context
[1]
[2]
The text was updated successfully, but these errors were encountered: