Skip to content

Latest commit

 

History

History
772 lines (540 loc) · 26.1 KB

0058-error-handling.md

File metadata and controls

772 lines (540 loc) · 26.1 KB

Meta

Summary

Defines the error handling semantics of the SDK-3.0 and lists the exception or error classes that each SDK MUST implement.

Motivation

  • Consistent handling of errors or exceptions across all SDKs.
  • Classifies and defines the specific error or exceptions generated by the SDK.
  • Explains how specific errors should be handled by an application.

General Design

Exceptions (errors) are broken into different classifications:

  • Shared Exceptions - exceptions or errors thrown or returned by any service
  • Specific Exceptions - exceptions or errors thrown by a service and specific to the service that threw or returned them.
  • Internal Exceptions - exceptions intended to be handled internally by the SDK and not exposed to the application.

All exceptions derive from a common base Couchbase exception. Furthermore, exceptions can be either internal to the SDK, meaning that they should not be propagated to the application layer, or public and intended to allowed to bleed up to the application level.

The following diagram is a conceptual diagram showing the relationships and hierarchy of the exception/error model within the Couchbase SDKs:

figure 1: general design

Base Exception

The base exception should be a class, structure, or similar component which derives or implements the platform idiomatic class, structure or object representing and error or exception. The purpose is to readily distinguish an error handled and perhaps thrown by the SDK and a system or platform idiomatic error/exception. The name must be word "Couchbase" with either "Error" or "Exception" appended to it and derive from whatever system level generic exception or error is available. As always, idiomatic naming and casing should always apply to any SDK specific implementation.

Each CouchbaseException must have two properties:

  • An ErrorContext
  • An optional Cause
class CouchbaseException extends "PlatformException" {
  context: ErrorContext,
  cause: Optional<Exception>
}

The ErrorContext is a marker Interface which should be implemented for each service, but is implementation specific. The code must be able to serialize this into a JSON format and dump it next to the error/exception when logged. This is important to provide as much context as possible when debugging.

As a rough guideline, here are samples what the individual implementations could contain:

interface ErrorContext {
    lastDispatchedTo: Optional<String>,
    lastDispatchedFrom: Optional<String>,
    retryAttempts: int,
     retryReasons: Set<RetryReason>
}

Service specific context extends the generic Error context with the following attributes (if present):

  • KeyValueErrorContext

    • Document Key
    • Opaque
    • KV Status Code
    • Bucket name
    • Collection name
    • Scope name
    • KV Error Map Info (if present)
      • Name
      • Desc
    • Extended Error Information (if present by server)
      • XError Ref
      • XError Context
  • SubDocumentErrorContext

    • Extends KeyValueErrorContext fields
    • Adds "index" which describes the index of the error
  • QueryErrorContext

    • Query Statement
    • ClientContextId
    • Query Parameters
    • Query Error Http Response Code
    • Query Error Response Body
  • AnalyticsErrorContext

    • Query Statement
    • ClientContextId
    • Analytics Parameters
    • Analytics Error Http Response Code
    • Analytics Error Response Body
  • ViewErrorContext

    • Design Document Name
    • View Name
    • View Query Params
    • View Error Http Response Code
    • View Error Response Body
  • SearchErrorContext

    • Index Name
    • Search Query
    • Search Params
    • Search Error Http Response Code
    • Search Error Response Body

If the more specific error contexts above do not apply, it is recommended to fallback to a more generic http error context. As much information as possible (and makes sense) must be included:

  • HTTPErrorContext
    • Method
    • Request Path
    • Response Code
    • Response Body

In addition each context might be able to peek into the additional information defined by the RTO RFC, so that it includes properties like: dispatch time, local socket and remote socket, encoding time, etc.

If an underlying exception has been the cause of the user facing one (this will heavily depend on the platform and actual error), it should be attached as the cause so that nested stack traces are visible in the logs.

Shared Exceptions

A shared exception is common across all services and specific to known. Examples include:

  • TemporaryFailureException
  • TimeoutException
  • RequestCancelledException
  • Etc.

These may or may not be exceptions defined by the underlying platform. These are to be seen as on the same level as the service exceptions and draw directly from the CouchbaseException.

Specific Exceptions (and Internal Exceptions)

Specific errors are always returned from Couchbase server itself and are specific to the service which generated them. Specific exceptions may also be Internal Exceptions, meaning that they are handled internally by the SDK and not propagated to the application.

Examples of specific exceptions:

  • KeyNotFoundException
  • NotMyVBucketException
  • Etc.

The purpose of these errors is to allow for fine-tuned handling of them by either the SDK if they are intended to be an internal error or the application itself.

Aggregate Exception/Error

In certain situations, like bootstrapping, where multiple exceptions might be thrown (one for each server in the bootstrap list until success is reached), an Aggregate Exception/Error may be returned/thrown.

figure 2: Aggregate Exception

Above is an example of the structure that a model of an AggregateException/Error may take, however, please defer to platform idiomatic design decisions when implementating.

Aggregate exceptions can be platform specific and may not be needed / necessary in every language. If possible, errors should be distilled down to one exception with potentially multiple nested causes to make it easier for the user to figure out what's going on.

User Surface Area

The user is only expected to catch/handle the "leaf" nodes of the exception hierarchy and the generic Couchbase exception. The service exception groups will only be raised if no specific exception can be identified (i.e. a too generic or unknown exception) - it is expected to bubble up through the CouchbaseException still in this case.

So in its most simplest form, because based on the retry rfc, we are retrying everything we can already a user is expected to write code like this:

try {
  some_couchbase_call();
} catch (CouchbaseException x) {
 // log and rethrow
}

If a user then needs to handle certain leaf exceptions directly, they can add additional clauses but still let the rest bubble up:

try {
  collection.get();
} catch (KeyNotFoundException x) {
 // handle a key not found separately
} catch (CouchbaseException x) {
 // log and rethrow
}

Error Definitions

The next sections cover all the possible errors, but do not assign them to commands. Each command in a related RFC needs to refer to the exception type and ID thrown so that there are no "doubly linked lists" that easily get out of date.

Note that for brevity reasons the suffix "Exception" is not used when describing each definition.

A Note on IDs: The IDs in this RFC are only for organisational purposes and MUST NOT be surfaced to the user at this point. Future enhancements w.r.t permalinks from the SDKs can use these numbers as a baseline but additional research needs to be done.

0 CouchbaseException

  • Inherits from System Exception/Error
  • Parent Type for all exceptions below
  • Shared Error Definitions (ID Range: 1-99)

1 Timeout

  • Raised When
    • A request cannot be completed until the user-defined timeout fires
  • Notes
    • This is a base class for #13 and #14 ! (Ambiguous and Unambiguous)

2 RequestCanceled

  • Raised When
    • A request is cancelled and cannot be resolved in a non-ambiguous way. Most likely the request is in-flight on the socket and the socket gets closed.

3 InvalidArgument

  • Raised When
    • It is unambiguously determined that the error was caused because of invalid arguments from the user
    • KV Subdoc:
      • 0xcb
    • Notes
      • Usually only thrown directly when doing request arg validation
      • Also commonly used as a parent class for many service-specific exceptions (see below)

4 ServiceNotAvailable

  • Raised when
    • It can be determined from the config unambiguously that a given service is not available. I.e. no query node in the config, or a memcached bucket is accessed and views or n1ql queries should be performed

5 InternalServerFailure

  • Raised When
    • Query: Error range 5xxx
    • Analytics: Error range 25xxx
    • KV: error code ERR_INTERNAL (0x84)
    • Search: HTTP 500

6 AuthenticationFailure

  • Raised When
    • Query: Error range 10xxx
    • Analytics: Error range 20xxx
    • View: HTTP status 401
    • KV: error code ERR_ACCESS (0x24), ERR_AUTH_ERROR (0x20), AUTH_STALE (0x1f)
    • Search: HTTP status 401, 403

7 TemporaryFailure

  • Raised When
    • Analytics: Errors: 23000, 23003
    • KV: Error code ERR_TMPFAIL (0x86), ERR_BUSY (0x85) ERR_OUT_OF_MEMORY (0x82), ERR_NOT_INITIALIZED (0x25)

8 ParsingFailure

  • Raised When
    • Query: code 3000
    • Analytics: codes 24000

9 CasMismatch

  • Raised When
    • KV: ERR_EXISTS (0x02) when replace or remove with cas
    • Query: code 12009 AND message contains CAS mismatch

10 BucketNotFound

  • Raised When
    • A request is made but the current bucket is not found

11 CollectionNotFound

  • Raised When
    • A request is made but the current collection (including scope) is not found

12 UnsupportedOperation

  • Raised When
    • KV: 0x81 (unknown command), 0x83 (not supported)

13 AmbiguousTimeout

  • Raised When
    • A timeout occurs and we aren't sure if the underlying operation has completed. This normally occurs because we sent the request to the server successfully, but timed out waiting for the response. Note that idempotent operations should never return this, as they do not have ambiguity.

14 UnambiguousTimeout

  • Raised When
    • A timeout occurs and we are confident that the operation could not have succeeded. This normally would occur because we received confident failures from the server, or never managed to successfully dispatch the operation.

15 FeatureNotAvailable

  • Raised When
    • A feature which is not available was used.

16 ScopeNotFound

  • Raised When:
    • A management API attempts to target a scope which does not exist.

17 IndexNotFound

  • Raised When
    • Query:
      • Codes 12004, 12016
      • (warning: regex ahead!) Codes 5000 AND message contains index .+ not found
    • Analytics
      • Raised When 24047
    • Search
      • Http status code 400 AND text contains "index not found"

18 IndexExists

  • Query
    • Raised When
    • Note: the uppercase index for 5000 is not a mistake (also only match on exist not exists because there is a typo somewhere in query engine which might either print exist or exists depending on the codepath)
    • Code 5000 AND message contains Index .+ already exist
    • Code 4300 AND message contains index .+ already exist
  • Analytics
    • Raised When 24048

19 EncodingFailure

  • Raised when encoding of a user object failed while trying to write it to the cluster

20 DecodingFailure

  • Raised when decoding of the data into the user object failed

21 RateLimited

  • Raised when a service decides that the caller must be rate limited due to exceeding a rate threshold of some sort.
  • Note that since there are many different reasons why a request is rate limited, the error context MUST include the reason / specific type of rate limiting cause for debugging purposes.

Maps to:

  • KeyValue
    • 0x30 RateLimitedNetworkIngress
    • 0x31 RateLimitedNetworkEgress
    • 0x32 RateLimitedMaxConnections
    • 0x33 RateLimitedMaxCommands
  • Cluster Manager (body check tbd)
    • HTTP 429, Body contains "Limit(s) exceeded [num_concurrent_requests]"
    • HTTP 429, Body contains "Limit(s) exceeded [ingress]"
    • HTTP 429, Body contains "Limit(s) exceeded [egress]"
    • Note: when multiple user limits are exceeded the array would contain all the limits exceeded, as "Limit(s) exceeded [num_concurrent_requests,egress]"
  • Query
    • Code 1191, User has more requests running than allowed
    • Code 1192, User has exceeded request rate limit
    • Code 1193, User has exceeded input network traffic limit
    • Code 1194, User has exceeded results size limit
    • Note: use the error code instead of the text to disambiguate the cause.
  • Search
    • HTTP 429, match body contains "num_concurrent_requests"
    • HTTP 429, match body contains "num_queries_per_min"
    • HTTP 429, match body contains "ingress_mib_per_min"
    • HTTP 429, match body contains "egress_mib_per_min"
  • Not applicable to Analytics at the moment
  • Not applicable to views

22 QuotaLimited

  • Raised when a service decides that the caller must be limited due to exceeding a quota threshold of some sort.
  • Note that since there are many different reasons why a request is quota limited, the error context MUST include the reason / specific type of quota limiting cause for debugging purposes.

Maps to:

  • KeyValue
    • 0x34 ScopeSizeLimitExceeded
  • Cluster Manager (body check tbd)
    • HTTP 429, Body contains "Maximum number of collections has been reached for scope "<scope_name>""
  • Query
    • Code 5000, Body contains "Limit for number of indexes that can be created per scope has been reached. Limit : value"
  • Search
    • HTTP 400 (Bad request), match body contains "num_fts_indexes"
  • Not applicable to Analytics at the moment
  • Not applicable to views

KeyValue Error Definitions (ID Range 100 - 199)

101 DocumentNotFound

  • Raised When
    • The document requested was not found on the server.
    • KV Code 0x01

102 DocumentUnretrievable

  • Raised When
    • In getAnyReplica, the getAllReplicas returns an empty stream because all the individual errors are dropped (i.e. all returned a DocumentNotFound)

103 DocumentLocked

  • Raised When
    • The document requested was locked.
    • KV Code 0x09

104 ValueTooLarge

  • Raised When
    • The value that was sent was too large to store (typically > 20MB)
    • KV Code 0x03

105 DocumentExists

  • Raised When
    • An operation which relies on the document not existing fails because the document existed.
    • KV Code 0x02

106 {RESERVED}

107 DurabilityLevelNotAvailable

  • Raised When
    • The specified durability level is invalid.
    • KV Code 0xa0

108 DurabilityImpossible

  • Raised When
    • The specified durability requirements are not currently possible (for example, there are an insufficient number of replicas online).
    • KV Code 0xa1

109 DurabilityAmbiguous

  • Raised When
    • A sync-write has not completed in the specified time and has an ambiguous result - it may have succeeded or failed, but the final result is not yet known.
    • A SEQNO OBSERVE operation is performed and the vbucket UUID changes during polling.
    • KV Code 0xa3

110 DurableWriteInProgress

  • Raised When
  • A durable write is attempted against a key which already has a pending durable write.
  • KV Code 0xa2

111 DurableWriteReCommitInProgress

  • Raised When
    • The server is currently working to synchronize all replicas for previously performed durable operations (typically occurs after a rebalance).
    • KV Code 0xa4

112 {RESERVED}

113 PathNotFound

  • Raised When
    • The path provided for a sub-document operation was not found.
    • KV Code 0xc0

114 PathMismatch

  • The path provided for a sub-document operation did not match the actual structure of the document. KV Code 0xc1

115 PathInvalid

  • Raised When
    • The path provided for a sub-document operation was not syntactically correct.
    • KV Code 0xc2

116 PathTooBig

  • Raised When
    • The path provided for a sub-document operation is too long, or contains too many independent components.
    • KV Code 0xc3

117 PathTooDeep

  • Raised When
    • The document contains too many levels to parse.
    • KV Code 0xc4

118 ValueTooDeep

  • Raised When
    • The value provided, if inserted into the document, would cause the document to become too deep for the server to accept.
    • KV Code 0xca

119 ValueInvalid

  • Raised When
    • The value provided for a sub-document operation would invalidate the JSON structure of the document if inserted as requested.
    • KV Code 0xc5

120 DocumentNotJson

  • Raised When
    • A sub-document operation is performed on a non-JSON document.
    • KV Code 0xc6

121 NumberTooBig

  • Raised When
    • The existing number is outside the valid range for arithmetic operations.
    • KV Code 0xc7

122 DeltaInvalid

  • Raised When
    • The delta value specified for an operation is too large.
    • KV Code 0xc8

123 PathExists

  • Raised When
    • A sub-document operation which relies on a path not existing encountered a path which exists.
    • KV Code 0xc9

124 XattrUnknownMacro

  • Raised When
    • A macro was used which the server did not understand.
    • KV Code: 0xd0

125 {RESERVED}

126 XattrInvalidKeyCombo

  • Raised When
    • A sub-document operation attempts to access multiple xattrs in one operation.
    • KV Code: 0xcf

127 XattrUnknownVirtualAttribute

  • Raised When
    • A sub-document operation attempts to access a virtual attribute.
    • KV Code: 0xd1

128 XattrCannotModifyVirtualAttribute

  • Raised When
    • A sub-document operation attempts to modify a virtual attribute.
    • KV Code: 0xd2

129 {RESERVED}

130 XattrNoAccess

  • Raised When
    • The user does not have permission to access the attribute. Occurs when the user attempts to read or write a system attribute (name starts with underscore) but does not have the SystemXattrRead / SystemXattrWrite permission.
    • KV Code: 0x24

Query Error Definitions (ID Range 200 - 299)

201 PlanningFailure

  • Query
    • Raised When code range 4xxx other than those explicitly covered

202 IndexFailure

  • Query
    • Raised When code range 12xxx and 14xxx (other than 12004 and 12016)

203 PreparedStatementFailure

  • Query
    • Raised When codes 4040, 4050, 4060, 4070, 4080, 4090

204 DmlFailure

  • Query
    • Raised when code 12009 AND message does not contain CAS mismatch

Analytics Error Definitions (ID Range 300 - 399)

301 CompilationFailure

  • Raised When
    • Error range 24xxx (excluded are specific codes in the errors below)

302 JobQueueFull

  • Raised When
    • Error code 23007

303 DatasetNotFound

  • Raised When
    • Error codes 24044, 24045, 24025

304 DataverseNotFound

  • Raised When
    • Error code 24034

305 DatasetExists

  • Raised When 24040

306 DataverseExists

  • Raised When 24039

307 LinkNotFound

  • Raised When 24006

Search Error Definitions (ID Range 400 - 499)

  • There are no specific errors for Search; see the Shared Error Definitions section for errors that apply to Search.

View Error Definitions (ID Range 500 - 599)

501 ViewNotFound

  • Raised When
    • Http status code 404
    • Reason or error contains "not_found"

502 DesignDocumentNotFound

  • Raised on the Management APIs only when:
    • Getting a design document
    • Dropping a design document
    • And the server returns 404

Management API Error Definitions (ID Range 600 - 699)

601 CollectionExists

  • Raised from the collection management API

602 ScopeExists

  • Raised from the collection management API

603 UserNotFound

  • Raised from the user management API

604 GroupNotFound

  • Raised from the user management API

605 BucketExists

  • Raised from the bucket management API

606 UserExists

  • Raised from the user management API

607 BucketNotFlushable

  • Raised from the bucket management API

Field-Level Encryption Error Definitions (ID Range 700 - 799)

700 CryptoException

  • Generic cryptography failure.
  • Inherits from CouchbaseException (# 0)
  • Parent Type for all other Field-Level Encryption errors

701 EncryptionFailure

  • Raised by CryptoManager.encrypt() when encryption fails for any reason.
  • Should have one of the other Field-Level Encryption errors as a cause.

702 DecryptionFailure

  • Raised by CryptoManager.decrypt() when decryption fails for any reason.
  • Should have one of the other Field-Level Encryption errors as a cause.

703 CryptoKeyNotFound

  • Raised when a crypto operation fails because a required key is missing.

704 InvalidCryptoKey

  • Raised by an encrypter or decrypter when the key does not meet expectations (for example, if the key is the wrong size).

705 DecrypterNotFound

  • Raised when a message cannot be decrypted because there is no decrypter registered for the algorithm.

706 EncrypterNotFound

  • Raised when a message cannot be encrypted because there is no encrypter registered under the requested alias.

707 InvalidCiphertext

  • Raised when decryption fails due to malformed input, integrity check failure, etc.

SDK-Specific Error Definitions (ID Range 1000 - 2000)

  • This range is reserved for sdk-specific error codes which are not standardized, but might be used later.

Changelog

  • Initial Version #1
  • Nov 6, 2019 - Revision #2 (by Michael Nitchinger)
    • Lots of rework (needs a re-read!)
    • Nov 8, 2019 - Revision #3 (by Michael Nitchinger)
    • Flattened the exception hierarchy
    • Added Error Context
    • Extended sections for each error code
  • Nov 11, 2019 - Revision #4 (by Michael Nitchinger)
    • Change KeyNotFound, KeyExists and KeyLocked to DocumentNotFound, DocumentExists and DocumentLocked, since mixing "id" and "key" is not a good idea
    • Changed InvalidArguments to InvalidArgument
  • Nov 15, 2019 - Revision #5 (by Michael Nitchinger)
    • Added SubdocException and a note which errors extend from it, including the index reference
  • Nov 18, 2019 - Revision #6 (by Michael Nitchinger)
  • Nov 21, 2019 - Revision #7 (by Michael Nitchinger)
    • Added QueryIndexNotFoundException
    • Added DataverseNotFoundException
    • Added code to DatasetNotFoundException
    • Added a couple dataverse, dataset and analytics index exceptions that are all needed in the management APIs
    • Renamed RequestTimeoutException to TimeoutException
    • Added Unambiguous and Ambiguous Timeout Exception
  • Nov 22, 2019 - Revision #8 (by Michael Nitchinger)
    • Added a "not_found" payload case to ViewNotFound (in addition to 404)
  • Dec 6, 2019 - Revision #9 (by Michael Nitchinger)
    • Subdoc DeltaRangeException changed to DeltaInvalidException
    • Added #124 - #129, explicit "explosion" of all xattr exceptions into the list
    • Changed ParsingFailed to ParsingFailure
    • Changed PlanningFailed to PlanningFailure
    • Changed CompilationFailed to CompilationFailure
    • Add Shared ScopeNotFoundException
    • Added section on generic management API definitions which do not fit a specific service bucket
  • Dec 9, 2019 - Revision #10 (by Michael Nitchinger)
    • Added UserNotFound and GroupNotFound to the generic mgmt api section
    • Added BucketAlreadyExists to the generic mgmt api section
  • Dec 11, 2019 - Revision #11 (by Michael Nitchinger)
    • Removed SubDocumentException, Added SubDocumentErrorContext (which holds the additional "index" field)
    • Subdoc CannotInsertValue renamed to ValueInvalid
    • Renamed InternalServer to InternalServerFailure
    • Renamed Authentication to AuthenticationFailure
    • Renamed QueryIndex to generic IndexFailure
    • Consolidate Analytics, Search and Query IndexNotFoundException into a single shared IndexNotFoundException
    • Consolidate AnalyticsIndexExists and QueryIndexExists into single IndexExists*
    • Renamed AnalyticsLinkNotFound to LinkNotFound
    • Added Encoding and DecodingFailure which should cover generically anything that breaks while encoding or decoding user data (both KV and services)
  • Dec 17, 2019 - Revision #12 (by Michael Nitchinger)
    • Changed IndexExists for query regex to match on "...exist" and not "...exists" because of a typo in the query engine
  • Jan 13, 2020 - Revision #13 (by Michael Nitchinger)
    • Renamed PreparedStatement to PreparedStatementFailure (#203)
    • Renamed CollectionAlreadyExists to CollectionExists (#601)
    • Added UserExists (#606)
    • Renamed BucketAlreadyExists to BucketExists (#605)
    • Added BucketNotFlushable (#607)
  • Jan 14, 2020 - Revision #14 (by Michael Nitchinger)
    • Renamed ScopeAlreadyExists to ScopeExists (#602)
    • Renamed BucketAlreadyExists to BucketExists (#605)
  • Jan 15, 2020 - Revision #15 (by Michael Nitchinger)
    • Added Analytics parameters to the analytics error context (and renamed "params" to "parameters")
  • Jan 16, 2020 - Revision #16 (by Michael Nitchinger)
    • Moved EncodingFailure and DecodingFailure to the end of the list and gave it number #19 and #20.
    • Added definition for HTTPErrorContext to cover generic HTTP errors.
    • Fixed a bunch of missing error code "raised when" status values.
    • Removed the MutationLost error and mapped that case under DurabilityAmbiguous instead.
    • Added descriptions to many different error codes.
  • April 30, 2020
    • Moved RFC to ACCEPTED state.
  • May 27, 2020 - Revision #17 (by David Nault)
    • Added Field-Level Encryption errors
  • August 10, 2020 - Revision #18 (by David Nault)
    • Added KV XattrNoAccess

Post-Accepted Changes

  • April 1, 2021 - Revision #19 (by Michael Nitschinger)
    • Added the query DmlFailure and clarified the cas mismatch for query failures.

Signoff

Language Representative Date Revision
Node.js Brett Lawson April 16, 2020 #16
Go Charles Dixon April 22, 2020 #16
Connectors David Nault April 29, 2020 #16
PHP Sergey Avseyev April 22, 2020 #16
Python Ellis Breen April 29, 2020 #16
Scala Graham Pople April 30, 2020 #16
.NET Jeffry Morris April 22, 2020 #16
Java Michael Nitschinger April 16, 2020 #16
C Sergey Avseyev April 22, 2020 #16
Ruby Sergey Avseyev April 22, 2020 #16