Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WMAgent: Properly parse and handle DBS client erros #12229

Open
todor-ivanov opened this issue Jan 15, 2025 · 6 comments
Open

WMAgent: Properly parse and handle DBS client erros #12229

todor-ivanov opened this issue Jan 15, 2025 · 6 comments

Comments

@todor-ivanov
Copy link
Contributor

Impact of the new feature
WMAgent

Is your feature request related to a problem? Please describe.
This issue is a followup on our findings during this investigation: #11965

The list of the so foreseen at the time, development and operational issues was here: #11965 (comment) Out of those, all OPS issues have been already completed and from the DEV issues only the first one has remained, since for the other two:

  • DEV3: Debug and find why are we loosing the HTTP header when we switch to APS frontend - it was due to a bad combination/missing of values for the following flags at the APS fronend: keepAlive && keepAliveTimeout. It would be a task redirected the CMSWeb team

  • DEV2: Debug and fix the bug which caused the blocks overlap only in DBS - We will never know. The Error that caused the blocks scramble between datasets at T0 have never been found, the historical data was not enough.

So this issue is meant to cover only:

  • DEV1: Start properly handling a bigger subset of errors at the Agent returned by the DBS Server

As found and explained here: #11965 (comment), we do not parse the full error returned by the DBS Server and encoded in the HTTP header. But we rather take into consideration only the final error code. A good example was the debugged case in the quoted comment above, and it clearly shows that the dbs client (which is a dependency of ours) properly gets a full error described by the following data structure: [1], which on top of everything also includes the DBS server stacktrace as well. It is clearly seen that many times at the top of the stack or even from the nested errors encoded in the message field at [1] sits some type of ORACLE based error which is later transformed to the proper DBSError Code (101 - missing parent information in this case). So the current issue is twofold:

Describe the solution you'd like

  • Check how we handle the data structure provided by the DBS client
  • Enlarge the set of error codes we handle on our end and include any possible new ones that might concern us and we may have missed in the past.

Describe alternatives you've considered
Do nothing.

Additional context
[1]

[
   {
      "http" : {
         "method" : "POST",
         "x_forwarded_for" : "188.184.96.94:20438, 188.184.96.94",
         "user_agent" : "DBSClient/Unknown/",
         "code" : 400,
         "timestamp" : "2024-11-06 16:16:23.350982889 +0000 UTC m=+5760929.544914892",
         "x_forwarded_host" : "cmsweb-testbed.cern.ch",
         "path" : "/dbs/int/global/DBSWriter/bulkblocks",
         "remote_addr" : "10.100.148.128:41393"
      },
      "message" : "DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root Error: nested DBSError Code:103 Description:DBS DB query error, e.g.mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set",
      "exception" : 400,
      "error" : {
         "function" : "dbs.bulkblocks.InsertBulkBlocksConcurrently",
         "reason" : "DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set",
         "message" : "unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root",
         "stacktrace" : "goroutine 7968287 [running]:
github.com/dmwm/dbs2go/dbs.Error({0xb2ca20?, 0xc0006842d0?}, 0x65, {0xc000702000, 0x84}, {0xa5c044, 0x2b})
	/go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99
github.com/dmwm/dbs2go/dbs.(*API).InsertBulkBlocksConcurrently(0xc000236070)
	/go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:508 +0x605
github.com/dmwm/dbs2go/web.DBSPostHandler({0xb2f790, 0xc000aa01e0}, 0xc000686c60, {0xa3e07d, 0xa})
	/go/src/github.com/vkuznet/dbs2go/web/handlers.go:562 +0x109e
github.com/dmwm/dbs2go/web.BulkBlocksHandler({0xb2f790?, 0xc000aa01e0?}, 0xc000033f60?)
	/go/src/github.com/vkuznet/dbs2go/web/handlers.go:978 +0x3b
net/http.HandlerFunc.ServeHTTP(0x0?, {0xb2f790?, 0xc000aa01e0?}, 0x11?)
	/usr/local/go/src/net/http/server.go:2171 +0x29
github.com/dmwm/dbs2go/web.limitMiddleware.func1({0xb2f790?, 0xc000aa01e0?}, 0xc0006c6650?)
	/go/src/github.com/vkuznet/dbs2go/web/middlewares.go:110 +0x32
net/http.HandlerFunc.ServeHTTP(0xc0003c0f30?, {0xb2f790?, 0xc000aa01e0?}, 0xc0003af450?)
	/usr/loca",
         "code" : 101
      },
      "type" : "HTTPError"
   }
]

[2]

// DBS Error codes provides static representation of DBS errors, they cover 1xx range
const (
	GenericErrorCode               = iota + 100 // generic DBS error
	DatabaseErrorCode                           // 101 database error
	TransactionErrorCode                        // 102 transaction error
	QueryErrorCode                              // 103 query error
	RowsScanErrorCode                           // 104 row scan error
	SessionErrorCode                            // 105 db session error
	CommitErrorCode                             // 106 db commit error
	ParseErrorCode                              // 107 parser error
	LoadErrorCode                               // 108 loading error, e.g. load template
	GetIDErrorCode                              // 109 get id db error
	InsertErrorCode                             // 110 db insert error
	UpdateErrorCode                             // 111 update error
	LastInsertErrorCode                         // 112 db last insert error
	ValidateErrorCode                           // 113 validation error
	PatternErrorCode                            // 114 pattern error
	DecodeErrorCode                             // 115 decode error
	EncodeErrorCode                             // 116 encode error
	ContentTypeErrorCode                        // 117 content type error
	ParametersErrorCode                         // 118 parameters error
	NotImplementedApiCode                       // 119 not implemented API error
	ReaderErrorCode                             // 120 io reader error
	WriterErrorCode                             // 121 io writer error
	UnmarshalErrorCode                          // 122 json unmarshal error
	MarshalErrorCode                            // 123 marshal error
	HttpRequestErrorCode                        // 124 HTTP request error
	MigrationErrorCode                          // 125 Migration error
	RemoveErrorCode                             // 126 remove error
	InvalidRequestErrorCode                     // 127 invalid request error
	BlockAlreadyExists                          // 128 block xxx already exists in DBS
	FileDataTypesDoesNotExist                   // 129 FileDataTypes does not exist in DBS
	FileParentDoesNotExist                      // 130 FileParent does not exist in DBS
	DatasetParentDoesNotExist                   // 131 DatasetParent does not exist in DBS
	ProcessedDatasetDoesNotExist                // 132 ProcessedDataset does not exist in DBS
	PrimaryDatasetTypeDoesNotExist              // 133 PrimaryDatasetType does not exist in DBS
	PrimaryDatasetDoesNotExist                  // 134 PrimaryDataset does not exist in DBS
	ProcessingEraDoesNotExist                   // 135 ProcessingEra does not exist in DBS
	AcquisitionEraDoesNotExist                  // 136 AcquisitionEra does not exist in DBS
	DataTierDoesNotExist                        // 137 DataTier does not exist in DBS
	PhysicsGroupDoesNotExist                    // 138 PhysicsGroup does not exist in DBS
	DatasetAccessTypeDoesNotExist               // 139 DatasetAccessType does not exist in DBS
	DatasetDoesNotExist                         // 140 Dataset does not exist in DBS
	LastAvailableErrorCode                      // last available DBS error code
)
@anpicci anpicci changed the title WMAgent: Poperly parse and handle DBS client erros WMAgent: Properly parse and handle DBS client erros Jan 15, 2025
@amaltaro
Copy link
Contributor

Thanks for creating this issue, Todor. Here are some comments I have:

Check how we handle the data structure provided by the DBS client

I would rephrase this to the DBS Server, given that the error message is generated and communicated by the DBS Server. The DBS Client is only the pipeline between our client and the server.

Looking at the full error object you dumped above, I see a major inconsistency that makes it hard for the client to digest the actual error, they are:

"error" : { ... "reason" : "DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set",
"error" : { ... "code" : 101

Shouldn't the server provide an error code 103 instead of 101?

Second inconsistency I see comes from:

"error" : { ... "message" : "unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root",

based on this error message, shouldn't the server actually report 130 FileParent does not exist in DBS?

Looking at the DBS Server error codes, I would say we are only concerned about data/constraint related. By a quick look:

  • 128: already dealt with
  • 130, 131: for StepChain case, where we might inject the child dataset before its parent
  • 129, 132-140: which I understand to be caused by the internal concurrency in the server AND to be transient (1 or 2 failures for the same block, then it is supposed to succeed)

Most of the other errors are likely a server-side error and/or are not worth it to have a special handling (as I don't think there is any special action for them). So, recording a generic error message with the code/error/message provided by the server is probably sufficient enough.

@vkuznet
Copy link
Contributor

vkuznet commented Jan 16, 2025

Alan, I would like you to read message string properly, it is consistent with error code as it provides nested error message. Here is the one from Todor's example but only I splitted for your convenice a whole message into its nested parts:

# the message string
      "message" : "DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root Error: nested DBSError Code:103 Description:DBS DB query error, e.g.mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set",

# now let's split (parse) it in its parts:
      "message" : 
"DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root 
Error: nested 
DBSError Code:103 Description:DBS DB query error, e.g.mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set",

So, it shows you that DBS error 101 shows it happens in dbs.bulkblocks.InsertBulkBlocksConcurrently function, and provides is own message that it was unable find parent lfn .... Then, it shows the error 103 which comes from dbs.GetID which was called which goes into SQL with no rows in results set. The error code 103 is what at the very bottom happened at DB layer. the error code 101 is what at API level happened.

The DBS server is very good at errors as it provides multiple stacktrace from API level to a down SQL layer. At each layer the appropriate error is shown. The client must properly parse the error message to understand the reason.

Moreover, in the DBS go server errors has very well defined structure they obey, see https://github.com/dmwm/dbs2go/blob/master/dbs/errors.go#L84 In other words it is not arbitrary JSON but a data-struct which is serialized to JSON. This structure will always has reason, message, function, code and stacktrace. It is guaranteed by server code. Therefore any client will be able to rely on this structure to get all fields. How your client read the fields is not DBS business. The message is always shows nested errors which comes from high level API to lower level one. Just parse it!!!

@amaltaro
Copy link
Contributor

I understand that the information is present in the response object, @vkuznet .

However, the client usually expects a clear and concise reason for the error. The server is actually providing tons of information and with somehow redundant and/or confusing data. For instance:

  1. there is a top level message field that encodes everything else (Code, Message, Error - showing twice inside this field!), e.g.:
"message" : "DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root Error: nested DBSError Code:103 Description:DBS DB query error, e.g.mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set",

Should I use this top level message field for understanding the actual error? Why isn't it inside the error attribute?

  1. then there is the error top level attribute, with a nested structure:
      "error" : {
         "function" : "dbs.bulkblocks.InsertBulkBlocksConcurrently",
         "reason" : "DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set",
         "message" : "unable to find parent lfn /store/data/Run2024I/ParkingSingleMuon4/RAW/v1/000/386/640/00000/7c1b6c7b-a0bf-4f19-8dfa-1722884306c5.root",
         "stacktrace" : "goroutine 7968287 [running]:
github.com/dmwm/dbs2go/dbs.Error({0xb2ca20?, 0xc0006842d0?}, 0x65, {0xc000702000, 0x84}, {0xa5c044, 0x2b})
	/go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99
github.com/dmwm/dbs2go/dbs.(*API).InsertBulkBlocksConcurrently(0xc000236070)
	/go/src/github.com/vkuznet/dbs2go/dbs/bulkblocks2.go:508 +0x605
github.com/dmwm/dbs2go/web.DBSPostHandler({0xb2f790, 0xc000aa01e0}, 0xc000686c60, {0xa3e07d, 0xa})
	/go/src/github.com/vkuznet/dbs2go/web/handlers.go:562 +0x109e
github.com/dmwm/dbs2go/web.BulkBlocksHandler({0xb2f790?, 0xc000aa01e0?}, 0xc000033f60?)
	/go/src/github.com/vkuznet/dbs2go/web/handlers.go:978 +0x3b
net/http.HandlerFunc.ServeHTTP(0x0?, {0xb2f790?, 0xc000aa01e0?}, 0x11?)
	/usr/local/go/src/net/http/server.go:2171 +0x29
github.com/dmwm/dbs2go/web.limitMiddleware.func1({0xb2f790?, 0xc000aa01e0?}, 0xc0006c6650?)
	/go/src/github.com/vkuznet/dbs2go/web/middlewares.go:110 +0x32
net/http.HandlerFunc.ServeHTTP(0xc0003c0f30?, {0xb2f790?, 0xc000aa01e0?}, 0xc0003af450?)
	/usr/loca",
         "code" : 101
      },

And I ask again, what is the meaning of reason? The reason fields point to error code 103, but the code attribute says 101 (confusing!).

Then we also have the message nested attribute, which seems to be finally giving a clear error message (with no code though).

While I appreciate the richness of this error content reported by DBS Server, I feel like there is plenty of DBS expertise that needs to be embedded in order to parse it. Is there a documentation listing each of the fields (nested or not) and explaining what each one of them is? Which ones are relevant? Which field gives me the actual reason for the failure (without cascading failures)?

For this example that we are discussing, the actual error must be unable to find parent lfn, while the other errors are only overcomplicating the actual understanding of the problem. Like "sql: no rows in result set" and "mailformed SQL statement" (which has a bad typo btw) are a consequence of not finding the parent lfn, hence it could be hidden IMO.

@vkuznet
Copy link
Contributor

vkuznet commented Jan 17, 2025

Alan, here are definitions of various attributes in error structure:

  • function provides location of the function within a code, e.g. dbs.bulkblocks.InsertBulkBlocksConcurrently means that it comes from dbs module, bulkclocks.go file, and name of the function is InsertBulkBlocksConcurrently
  • message defines an API action failure, e.g. API was trying to locate parent and fail, this is why it reports message about API action
  • reason defines nested code failure, e.g. the code setup transaction, and executed specific SQL to find a parent. In this example it fails at database layer (but if may fail due to transaction or network issues). The reason capture all actions code was doing. In this case: it was placing a query and fails (this is code 103 which defines QueryErrorCode), it captured the database response no rows in result set (this is what SQL driver returns from ORACLE).
  • code represents error in an API returned to upstream, in this case DatabaseErrorCode
  • stacktrace provides full stack track (python equivalent of Traceback) where error occurs

In this case error happen in https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L508, in particular here is exact location of code block which fails:

        pfid, err := QueryRow("FILES", "file_id", "logical_file_name", plfn)
        if err != nil {
            msg := fmt.Sprintf("unable to find parent lfn %s", plfn)
            return Error(err, DatabaseErrorCode, msg, "dbs.bulkblocks.InsertBulkBlocksConcurrently")
        }

The err comes from QueryRow function which try to get file id from FILES table. It is parsed error structure in form of human string whose code is 103. But the code returned to upstream is generic DatabaseErrorCode (code 101). I understand that it may be confused but it captures all tiny details of what happen at API and lower level (database or potentially other) layers. The client should use code and message as it provides high level error of what happen (in this case it is database error) and provide description as message field. If you need more details why it happen client can look into reason to see why it happens and reason will have its own code, description, etc. to outline these details.

Maybe the better representation would be to use nested errors instead of reason which represents parsed nested error structure. This can be better representation but it will require code changes. If you think that is will improve client readability and parsing then the changes will be required in the following parts:

Otherwise just stick to use code and message to interpret DBS API failures, if you ever need to know more details look into reason which defines why API fails and provides parsed error.

If you want to hide internal details of API failure, then modify relevant parts of DBS code to "hide" these details, e.g. change line https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L508
from being

return Error(err, DatabaseErrorCode, msg, "dbs.bulkblocks.InsertBulkBlocksConcurrently")

to be

return Error(nil, DatabaseErrorCode, msg, "dbs.bulkblocks.InsertBulkBlocksConcurrently")

But hiding information will certainly bytes you when you'll need to understand the reason why API fails to perform its action. Therefore, my suggestion is avoid "hide" approach as much as possible.

@amaltaro
Copy link
Contributor

amaltaro commented Jan 17, 2025

Thank you for this explanation Valentin. In order not to lose this information, I think it would be great to document it somewhere more persistent than a GH issue/PR.

Regarding the level of details and the full chain of errors, I would keep it because it might be useful at some point for some uncommon errors. However, what I believe to be wrong is the high level error reported in this response object. For instance:

The err comes from QueryRow function which try to get file id from FILES table. It is parsed error structure in form of human string whose code is 103. But the code returned to upstream is generic DatabaseErrorCode (code 101).

this looks wrong to me. The generic database error should not have been the error code reported to the user, because this was just a side effect of not having found the parent file. A more appropriate error code would have been 130 instead.
If DBS wraps most of those codes into a generic 101, then the other specific error codes are kind of useless.

Even if we rely on the code and message as you suggested - for a high level understanding of the actual problem, it will not work. They are not consistent, at least for the use case that we are discussing here.

Anyhow, I think I am just repeating myself here...

@vkuznet
Copy link
Contributor

vkuznet commented Jan 17, 2025

I'm glad that it is clear and we found a common ground. I agree that in this case a FileParentDoesNotExist (code 130) is more appropriate and it should be easily fixed. I suggest that Todor make appropriate change at line https://github.com/dmwm/dbs2go/blob/master/dbs/bulkblocks2.go#L508 and may be look at other places for high level error use cases. I also agree that appropriate explanation should be added and Todor can put my #12229 (comment) into docs area where dbs codebase keeps all documentation. Please note that all documentation is kept in markdown data-format and I suggest to keep it this way. Then it can be easily references, e.g. in DBSServer.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

3 participants