Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
…lone Summary: As part of the clone workflow, we repartition all the tables of the target database that has been created by executing the dump script. This means removing the old tablets and creating new ones during the import snapshot phase. However, we saw some cases where the old tablets are cached in the meta-cache of the tserver that executed the schema creation script. Other tservers can also have these stale metacache entries. For example, as part of executing `CREATE INDEX`, we send `BACKFILL INDEX` queries to the tserves that host the base table tablets' leaders which populates the cache with old tablets. The stale meta-cache entries are used later to execute the queries that arrive to tservers. However, the stale tablets are deleted in the import snapshot phase which leads to the following error: ``` d3=# select count(*) from t2 where age<18; ERROR: LookupByIdRpc(tablet: 89b4445772d2415aa1702a77031b7d74, num_attempts: 2) failed: Tablet deleted: Not serving tablet deleted upon request at 2024-08-01 15:39:31 UTC ``` It is worth mentioning that we encounter this issue only in the first query that is executed in the tserver with stale metacache. If we retry the same query another time, it will work fine as the meta-cache has invalidated the stale entry. We saw this issue only in the colocated database when there is an index. This is because as part of executing `CREATE INDEX` command, we ask for the TableLocations of the parent colocated tablet. The diff fixes the problem by introducing a new tserver RPC `ClearMetaCacheEntriesForNamespace` which clears all the metacache entries (tables and tablets) related to the clone database. This RPC is sent to all tservers as part of clone workflow. More specifically, clearing the metacache happens at the final step of clone i.e. after successfully restoring the snapshot on the clone database but before enabling user connections to the database. User connections to the clone database are enabled after successfully clearing the stale metacache entries of all tservers. **Upgrade/Rollback safety** The diff adds a new RPC `ClearMetacache` that is only used in instant database cloning workflow currently. The clone feature is protected by the preview flag: `enable_db_clone`. Jira: DB-10520, DB-10522 Test Plan: ./yb_build.sh fastdebug --cxx-test integration-tests_minicluster-snapshot-test --gtest_filter Colocation/PgCloneTestWithColocatedDBParam.CloneAfterDropIndex/1 Also tested manually that the ClearMetacache is clearing only the entries that belong to one specific database using the end point: `:9000/api/v1/meta-cache` which shows the set of tablets in the metacache. I checked that the tablet `0000000000` is not cleared after executing the RPC as intented. Reviewers: asrivastava, mlillibridge Reviewed By: asrivastava Subscribers: yguan, ybase, slingam Differential Revision: https://phorge.dev.yugabyte.com/D37353
- Loading branch information