-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
view has the wrong signature #4994
Comments
That happens when a view shard is opened but has a view signature that's not current. It comes from
The expected signature matches file path:
The other one is:
I don't recall seeing this error too often. Is there any chance your view shard files were moved, copied, restored from backup from an much older couch instance, or mounted on a volume shared across multiple nodes? Is it easy to reproduce? Just happened once or it's a regular occurrence?
That's expected as the next action after the log is to reset the view shard and rebuild. So in other words, after the view rebuilds, it should be back to normal. |
I would also ask that question: have the files been moved or renamed outside of couchdb's control? This error is not normal, and is a protection mechanism (effectively an assertion). another possibility is you have multiple nodes pointing at the same shared volume, and trashing each others state. |
I do not think we make something from the above.
This does not happen for 17 minutes and we have recreated the docker container. |
Just checked via AWS console, the CouchDB has "Multi-Attach enabled: no". |
Also, an issue with the |
I am not sure exactly why more CPU would be consumed on the other nodes. It could be that view shards there hadn't caught up as much as the shard which was reset so they started to get built. When this view was reset perhaps there were active requests waiting to receive rows (responses) from it and so waiting view clients might have been piling up. You can check for the number of waiting view clients with: https://docs.couchdb.org/en/stable/api/ddoc/common.html#get--db-_design-ddoc-_info To see if it's any view builds taking place, try using https://docs.couchdb.org/en/stable/api/server/common.html#active-tasks |
We have reproduced the same issue on the server without docker. |
It appears that it is possible for the design doc signature to vary according to something we have not yet determined. The client's cluster that we (Neighbourhoodie) are investigating gives the following results for one of the design docs affected by a bad index signature:
We copied this design doc and saved into a local dev cluster running the same CouchDB version/commit, and it gave the same md5sum and signature. However, when putting this same design doc into the CouchDB service installed by Homebrew, it gives the same md5sum but a different signature:
This indicates that it’s possible for the same design document to produce a different signature on different systems, possibly on different nodes of the same cluster depending on what the cause is. Environment of each test:
|
@jcoglan I suspect if term_to_binary output is different between architectures or erlang versions somehow. View signatures are computed in: couchdb/src/couch_mrview/src/couch_mrview_util.erl Lines 280 to 281 in 33bfa13
See if you can an add a log statement and dump the Erlang 26 had introduced the The other thing that's changed is UTF8 encoding of atoms. That's probably what the problem is here. See some discussion from last year about it: #4467 (comment) |
couchdb 3.3.3 does not use term_to_bin everywhere, so its output will vary based on whether it runs inside OTP 26 or earler. |
fixed since 3.3.3 with commit 453c698 |
noting that couchdb 3.3.3 explicitly rejects OTP 26;
|
|
Good find @rnewson it's the On 26
But I wonder if there is something else besides atoms there that would cause in-determinism, I'd hope not because that would be terrible. We don't use maps there, ref and pids but binary reference chunks might sneak in...? |
@nickva term_to_binary is fragile and we should not be using it for view signatures (at least). The original decision was done I'm sure out of expediency and/or (benign) ignorance. forcing {minor_version, 1} is likely a solid fix for years to come, but the only truly safe path out is to define the view signature algorithm explicitly. |
@sergey-safarov that's good news. Can it turned into a script or it's fairly simple to describe steps? |
For me, it isn't easy to reproduce. It randomly happens. |
I searched through our logs and found a few of "has the wrong signature" errors as well. In our case they all happened on nodes were being decommissioned and database shards were migrating to new nodes. Wonder if there is a higher chance of it happening if there is a network partition or shard map changes when a the design document updates at the same time... |
I seem to have been naughty when making the 3.3.3 Mac binaries, I swear I did this for a good reason, but I don’t recall it at the moment. I think the 25-jit failed on arm Mac, but I’m not sure: https://github.com/janl/build-couchdb-mac/blob/master/build.sh#L72-L73 None of this is relevant to the issue in the ticket, it just clarifies what @jcoglan reported |
When we upgrade empty view files from 2.x, we end up skipping the commit. Subsequently, we will see a "wrong signature" error when the view is opened later. The error is benign as we'd end up resetting an empty view, but it may surprise an operator. To avoid this, ensure to always commit after upgrading old views. Issue: #4994
@jcoglan and I have come up with a repro for the 'wrong signature' event: https://gist.github.com/jcoglan/0a5feb4af2a496ce10c9b80cf02ea28f. Our theory is that when an index is first queried following a v2->v3 upgrade, https://github.com/apache/couchdb/blob/3.3.3/src/couch_mrview/src/couch_mrview_index.erl#L121 will rename the file and return the old signature. however, maybe_update_index_file/1 just renames the file, it does not change its content. https://github.com/apache/couchdb/blob/3.3.3/src/couch_mrview/src/couch_mrview_index.erl#L127 matches and the normal index update path is followed, but if the index is empty then no new content and no new header is written, so the old signature remains in the file. the next time the view is queried, maybe_update_index_file/1 will do nothing (the old file does not exist) and return ok, so we hit this clause where wrong signature appears https://github.com/apache/couchdb/blob/3.3.3/src/couch_mrview/src/couch_mrview_index.erl#L139 |
When we upgrade empty view files from 2.x, we end up skipping the commit. Subsequently, we will see a "wrong signature" error when the view is opened later. The error is benign as we'd end up resetting an empty view, but it may surprise an operator. To avoid this, ensure to always commit after upgrading old views. Issue: #4994
When we upgrade empty view files from 2.x, we end up skipping the commit. Subsequently, we will see a "wrong signature" error when the view is opened later. The error is benign as we'd end up resetting an empty view, but it may surprise an operator. To avoid this, ensure to always commit after upgrading old views. Issue: #4994
It just reproduced the issue.
|
jan's theory makes sense to me. the 2.x -> 3.x upgrade code is flawed, it only tries the 2.0 sig if the file was found in the old location, and assumes we'll write a new header for all views on upgrade (any couchdb crash or restart before that would cause things to go wrong subsequently) suggest a different approach at https://github.com/apache/couchdb/blob/3.3.3/src/couch_mrview/src/couch_mrview_index.erl#L121 or so, we can try to match the v2 sig if the v3 sig fails. that is, decouple the moving of the file from old to new location from whether the file itself is v2 vs v3. |
Description
On one node (db-2.example.com) we catch error logs
Then it triggers CPU usage and is not responsible CouchDB 3 nodes cluster.
In logs present messages like
fabric_worker_timeout get_db_info
fabric_worker_timeout open_doc
fabric_worker_timeout open_doc
Steps to Reproduce
Not known.
Expected Behaviour
Error in the one view should not stop functionality of 3 nodes cluster.
Your Environment
Additional Context
Used
apache/couchdb:3.3.2
docker container.The text was updated successfully, but these errors were encountered: