-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
report better error messages after Diesel queries #907
Conversation
"Unknown diesel error creating {:?} called {:?}: {:#}", | ||
resource_type, object_name, error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did these lines end up being useful? I think the change is probably good regardless, so I'm just curious if you ran into another opaque error message that this helped resolve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I didn't wind up using these.
nexus/src/db/error.rs
Outdated
error => { | ||
let context = match make_not_found_error() { | ||
PublicError::ObjectNotFound { type_name, lookup_type } => { | ||
format!("looking up {:?} {:?}", type_name, lookup_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a nit, but is it true that this error is only hit when "looking up" a resource? I think your suggestion of a trait that provides an additional context method is really attractive, since it would allow us to provide this string or some other information from the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends on what you mean by "lookup". I believe it's always used in a context where we expect a specific object to exist and it doesn't. Sometimes that's because we're trying to update it or delete it, not fetch it. I could make the language more general ("accessing object") to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a qualitative improvement to me, thanks for digging into it! I agree that some trait for providing context would be nice. I'm also curious if we could just attach entire stack traces or something similar to the internal error messages. Those will only show up in logs, so the extra information would probably not be overkill. It might be too noisy though, I'm not sure.
In any case, +1 from me on this change.
I wound up making a pretty small change that eliminates the grossness. There's still room for improvement in terms of the context carried around by errors! |
Crucible: update rust crate base64 to 0.21.3 (#913) update rust crate slog-async to 2.8 (#915) update rust crate async-recursion to 1.0.5 (#912) Move active jobs into a separate data structure and optimize `ackable_work` (#908) Check repair IDs correctly (#910) update actions/checkout action to v4 (#903) update rust crate tokio to 1.32 (#890) Remove single-item Vecs (#898) Move "extent under repair" into a helper function (#907) offset_mod shouldn't be randomized (#905) Only rehash if a write may have failed (#899) Make negotiation state an enum (#901) Test update for fast write ack and gather errors on test failure (#897) Propolis: Update cpuid-gen util for better coverage Make storage backend config more flexible and consistent Use correct register sizes for PIIX3 PM device Update bitflags dependency fix softnpu port order (#517) Use hex formatting for unhandled MMIO/PIO/MSRs Update deps for new crucible and oximeter Update standalone-with-crucible docs (#514)
Crucible: update rust crate base64 to 0.21.3 (#913) update rust crate slog-async to 2.8 (#915) update rust crate async-recursion to 1.0.5 (#912) Move active jobs into a separate data structure and optimize `ackable_work` (#908) Check repair IDs correctly (#910) update actions/checkout action to v4 (#903) update rust crate tokio to 1.32 (#890) Remove single-item Vecs (#898) Move "extent under repair" into a helper function (#907) offset_mod shouldn't be randomized (#905) Only rehash if a write may have failed (#899) Make negotiation state an enum (#901) Test update for fast write ack and gather errors on test failure (#897) Propolis: Update cpuid-gen util for better coverage Make storage backend config more flexible and consistent Use correct register sizes for PIIX3 PM device Update bitflags dependency fix softnpu port order (#517) Use hex formatting for unhandled MMIO/PIO/MSRs Update deps for new crucible and oximeter Update standalone-with-crucible docs (#514) Co-authored-by: Alan Hanson <[email protected]>
This change is a little gross, but it greatly improves error messages that come from Diesel. When we get an unexpected error, it used to look like this:
Now it looks like this:
It's filled in (1) that we're doing a lookup, (2) the kind of resource, (3) how we're doing the lookup (by this specific id), and (4) it's stringified the Diesel error rather than showing the debug output.
I got here because I was looking more closely at a problem @bnaecker mentioned in chat the other day. Say you apply this change to schema.rs, which looks innocuous:
A whole bunch of integration tests seem to hang -- at least, a dozen or so were still running after several minutes. If you run a test individually, it fails due to oxidecomputer/steno#26:
and then hangs there. The only clue Ben and I found was in the log:
Note that that's the error message after the changes in this PR. Before those changes, it looks like this:
As I mentioned, this change is kind of ugly. It's using the "NotFound" error that we would generate if the row wasn't found to figure out what we were trying to do. Ugh.
I'm thinking of adding something to the
ApiResourceError
trait that returns a(ResourceType, LookupType)
tuple instead. I think this is easy to do, since eachauthz
type knows its resource type and stores its lookup_type. This way we don't have to infer from theObjectNotFound
(and write code for the impossible case that it's not anObjectNotFound
error.Another thing I considered: can the lookup API itself produce a better message here? Maybe? But it'd be kind of weird because it uses
public_error_from_diesel_pool()
, which has all this existing logic for picking apart what happened. Then we'd be taking the result of that, picking that apart in a similar way, and if it's an InternalError, we'd be augmenting it with more context. (A useful helper here might be some facility for taking a publicError
and if it's anInternalError
, augmenting the message with more context. Similar to anyhow'scontext()
function. Something like:do_something_that_might_fail.with_internal_context(|| format!("looking up thing with id {}", id))
. This would modify the error only if it was anError::InternalError
. Maybe this is worth looking into?)