Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No squash merge: Update proxy to TiKV(2f3f32d9a0de2ebdb987c00b2419761cfbda4556) #415

Merged

Conversation

CalvinNeo
Copy link
Member

@CalvinNeo CalvinNeo commented Jan 22, 2025

What is changed and how it works?

Issue Number: Close #xxx

#412

What's Changed:


Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note


SpadeA-Tang and others added 30 commits October 31, 2024 02:12
ref tikv#16141

rearrange parts of metrics panel

Signed-off-by: SpadeA-Tang <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#16141

Add test to simulate insertion of 200MB (logical size) of TiDB unqiue
index and secondary index records and measure SkiplistEngine memory
usage.
Test results:
* For secondary index
  * The key-value encoding amplification is approximately 3.10
  * SkiplistEngine amplification is approximately 7.66
* For unique index
  * The key-value encoding amplification is approximately 3.38
  * SkiplistEngine amplification is approximately 8.19

Signed-off-by: Neil Shen <[email protected]>
…v#17629)

ref tikv#17459

Track the number of locks of large txns in resolver

Signed-off-by: ekexium <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
)

close tikv#12587, fix tikv#16001

To fix the issue where slow region destruction can block snapshot
generation, this PR moves the snapshot generation logic out of the
region worker. A new worker is added to handle snap gen requests but it 
reuses the existing snap generator pool, so the change doesn't 
introduce any new threads.   

This is a simpler approach than the earlier attempt because it doesn't 
deal with the interactions between snapshot apply and destroy. Since 
snapshot generation has always been an independent task handled by its 
own thread pool, this change does not add significant complexity.

Signed-off-by: Bisheng Huang <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#15990

Add fsm schedule related metrics

Signed-off-by: Connor <[email protected]>
Signed-off-by: Connor1996 <[email protected]>

Co-authored-by: Bisheng Huang <[email protected]>
close tikv#12371

* switch kms to aws_sdk lib
* switch s3 to aws_sdk lib

Signed-off-by: Andrey Koshchiy <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…kv#17747)

ref tikv#16141

use stop-load-threshold for loading new regions

Signed-off-by: SpadeA-Tang <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
close tikv#17711

Deprecate write_global_seq, since it is by default false.

Signed-off-by: Yang Zhang <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…17730)

close tikv#17728

Use min_lock_ts-1 as the candidate of resolved-ts, to ensure resolved_ts < lock.min_commit_ts( <= commit_ts).

Signed-off-by: ekexium <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Co-authored-by: you06 <[email protected]>
ref tikv#16141, close tikv#17762

Let in_memory_engine's config`evict-threshold` and `stop-load-threshold`
default value generated from `capacity`.

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…v#17771)

close tikv#17767

IME observes all peer destroy events to timely evict regions. By adding
a new peer, the old and uninitialized peer will be destroyed and IME
must not panic in this situation.

Signed-off-by: Neil Shen <[email protected]>
close tikv#17572

Signed-off-by: RidRisR <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#15990

Raft waterfall metrics track the duration of individual requests, all 
beginning from the same starting point (when the async write request is 
scheduled) but ending at various stages of the write process. Previous 
descriptions did not make that clear and may confuse the readers. This 
commit improves the grafana descriptions for clarity.

Signed-off-by: Bisheng Huang <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
)

close tikv#17696

* take cdc tasks into memory quota to prevent the TiKV OOM caused by too many pending tasks

Signed-off-by: Neil Shen <[email protected]>
Signed-off-by: 3AceShowHand <[email protected]>

Co-authored-by: Neil Shen <[email protected]>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…tikv#17643)

close tikv#17363

Allow leader transfer if conf change applied on transferee.

Signed-off-by: hhwyt <[email protected]>

Co-authored-by: Bisheng Huang <[email protected]>
…ikv#17765)

close tikv#17383, close tikv#17760

To address the corner case where a read thread encounters a panic due to reading with a stale index from the `Memtable` in raft-engine, which has been updated by a background thread that has already purged the stale logs.

Signed-off-by: lucasliang <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
close tikv#17788

Avoid can `on_gc_finished` when a new GC task is not run because there is another unfinished task.

Signed-off-by: glorv <[email protected]>
…ikv#17515)

ref tikv#16141

This commit adjusts the following in-memory-engine defaults:

* `capacity`: Now IME uses 10% of the block cache and takes an equal
  amount of memory from the system. This is based on tests showing that
  the IME rarely fills its full capacity.
* `mvcc_amplification_threshold`: Change from 100 to 10 which benefit
  common workloads like TPCc (50 warehouse), saving approximately 20%
  of unified read pool CPU usage.

Also, it addresses two security issues:

* Remove ignore of RUSTSEC-2024-0006, as vulnerable shlex 0.1.1 is
  removed by tikv#13814
* Upgrade hashbrown from yanked 0.15.0 to 0.15.1

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…d twice (tikv#17798)

close tikv#17797

If the last call `prepare_for_region` returns `NotInCache`,
`clear_written_regions` can be called twice in both `write_impl` and
`clear`, which will cause panic. This pr changes `clear_written_regions`
to consume `self.written_regions`to avoid this kind of duplicate clear.

Signed-off-by: glorv <[email protected]>
ref tikv#16141

handle error when getting regions info

Signed-off-by: SpadeA-Tang <[email protected]>
close tikv#17701

add write batch limit for raft command batch

Signed-off-by: SpadeA-Tang <[email protected]>
Signed-off-by: SpadeA-Tang <[email protected]>
close tikv#17631

Added a new crate named `compact-log-backup`. Now it can merge some log files generated by log backup and make them become SSTs.

Signed-off-by: hillium <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
close tikv#17836

This commit adds metrics to track Raft snapshots that are dropped 
during sending or receiving due to concurrency limits. These metrics 
help identify bottlenecks during scaling.

Signed-off-by: Bisheng Huang <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…kv#17625)

close tikv#12410

This pr make the `campaign` of the newly splitted regions triggered in time, when the leadership of the parent region is stable after `on_role_changed`.

Signed-off-by: lucasliang <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…17841)

close tikv#17840

Skip handling remain raft messages after peer fsm is stopped. This can avoid potential panic if the raft message need to read raft log from raft engine.

Signed-off-by: glorv <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
close tikv#17852

expr: fix panic when using radians and degree

Signed-off-by: gengliqi <[email protected]>
)

close tikv#17830

Signed-off-by: joccau <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…ax_batch_size. (tikv#17821)

close tikv#17101

Increase the default raft_client_queue_size and raft_msg_max_batch_size.

This PR addresses an issue where too many Raft messages can delay 
sending, increasing the commit log duration and the heartbeat latency. 
The delayed heartbeats can lead to leader drops, especially during PD 
restarts that trigger a surge of hibernated regions. About this scenario, 
see more details at: tikv#17101.

We increased the raft_client_queue_size to prevent Raft messages from 
being dropped when the RaftClient queue becomes full under too many 
message workloads. Additionally, we increased the raft_msg_max_batch_size 
to improve the efficiency of Raft message sending.

Signed-off-by: hhwyt <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…sts (tikv#17500) (tikv#17870)

close tikv#17394

lock_manager: Skip updating lock wait info for non-fair-locking requests

This is a simpler and lower-risky fix of the OOM issue tikv#17394 for released branches, as an alternative solution to tikv#17451 .
In this way, for acquire_pessimistic_lock requests without enabling fair locking, the behavior of update_wait_for will be a noop. So that if fair locking is globally disabled, the behavior will be equivalent to versions before 7.0.

Signed-off-by: MyonKeminta <[email protected]>
hhwyt and others added 26 commits December 17, 2024 09:18
…O consumption (tikv#18006)

ref tikv#15990

This PR fixes improper error handling in calls to
get_thread_io_bytes_total and refactors related code for clarity.

get_thread_io_bytes_total can fail in two scenarios:
1. LOCAL_IO_STATS is not initialized.
2. Errors occur during ThreadId::fetch_io_bytes(), such as failing to
open files.

Previously:
• In io.rs, calls to get_thread_io_bytes_total did not handle the
  second type of error.
• In future.rs, error handling was implemented but not abstracted into
  a reusable utility.

This PR introduces IoBytesTracker, an error-tolerant utility. It starts
calculating incremental I/O bytes only after the first successful
initialization of fetch_io_bytes. Any I/O bytes consumed before
initialization are intentionally ignored.

This approach avoids larger inaccuracies by discarding potentially
unreliable data. Since io_bytes_total is a thread-local cumulative
metric, failures before the start of the statistical logic can result 
in a falsely underestimated initial value, which may lead to 
inaccurate delta calculations.

Signed-off-by: hhwyt <[email protected]>
close tikv#18008

If region is load after PrepareMerge(e.g. due to hot region load if
threshold is set too low), and then the region merge is rollbacked,
target region should either be evicted or update to newer epoch version.
This PR choose evict on rollback merge event for simplicity.

Signed-off-by: glorv <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
close tikv#18005

Do not skip handling raft command when peer fsm stopped. RaftCommand should always be handled or they will cause panic.

Signed-off-by: glorv <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
close tikv#18010

Signed-off-by: Hangjie Mo <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
… reception (tikv#17903)

close tikv#17881

Ensures `recving_count` is decremented before releasing the snapshot 
precheck resource. This prevents a race condition where a new precheck 
succeeds, but the receiver rejects the snapshot because it fails the 
`receiving_busy` check.

Signed-off-by: Bisheng Huang <[email protected]>
…nd`. (tikv#18023)

close tikv#17875

This PR optimizes the error handling progress when resolving address returns
by the response from PD. 

If the response contains `store xxx not found`, the resolver could directly returns
the `StoreTombstone` Error to make the raft-client end the retrying loop quickly.

Signed-off-by: lucasliang <[email protected]>
ref tikv#17939

Polish the logging when periodically updating the disk status.

Signed-off-by: lucasliang <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
close tikv#17916

concurrency_manager: add safety boundary for max_ts updates

Add `max_ts_limit` to prevent unreasonable timestamp updates. The limit is 
synchronized with PD timestamp periodically. Configure via max_ts_allowance_secs
 and max_ts_sync_interval_secs.

Updates from PD bypass this limit.

Signed-off-by: ekexium <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…ead (tikv#18022)

ref tikv#16141

This commit rollback PR tikv#17927 as it is not the root cause. This rollback can all use IME for following read scenario. 
NOTE: We decide not cherry-pick it back to v8.5 as there may be other potential issue.

Signed-off-by: glorv <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
close tikv#18046

Avoid loading region into IME when it is uninitialized to prevent panic
on encoding region end key. This is because `MsgPreLoadRegionRequest`
is sent before leader issue a transfer leader request.

Signed-off-by: Neil Shen <[email protected]>
ref tikv#15990

build: bump tikv pkg version

Signed-off-by: ti-chi-bot <[email protected]>
…by (tikv#18061)

close tikv#18060

Use regex expression in panel seriesOverrides to let it compatible with the optional "additional_groupby" alias.

Signed-off-by: glorv <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…f invalid max-ts update (tikv#18057)

close tikv#18055

concurrency_manager: double check via PD TSO before reporting error of invalid max-ts update

Signed-off-by: ekexium <[email protected]>
close tikv#17618

Fix a bug that wrongly truncates the string when the charset is gbk/gb18030

Signed-off-by: cbcwestwolf <[email protected]>
…tered (tikv#18066)

close tikv#18065

Print more information in logs when default not found error is encounterred.

Signed-off-by: cfzjywxk <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#15990

Export the number of currently running background jobs to help diagnose
potential compaction bottlenecks.

Signed-off-by: Neil Shen <[email protected]>

Co-authored-by: Bisheng Huang <[email protected]>
…ction (tikv#18085)

close tikv#18084

`min_input_ts` and `max_input_ts` will present in a log files compaction.

Signed-off-by: hillium <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ref tikv#15990

Fixed a typo: `Migartion` -> `Migration`.

Signed-off-by: hillium <[email protected]>
ref tikv#18055

When validating max-ts updates, do not report error or panic unless confirmed by PD TSO.
This reduces both false positive and false negative cases.

Signed-off-by: ekexium <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
close tikv#17894

build: update Dockerfile for build and test

Signed-off-by: wuhuizuo <[email protected]>

Co-authored-by: Ti Chi Robot <[email protected]>
close tikv#18026

Added a new RPC endpoint `flush_now` for the service `LogBackup`.

Signed-off-by: 山岚 <[email protected]>
Signed-off-by: hillium <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
close tikv#18105, ref pingcap/tidb#58238

Adapt ignore rules to make the download can skip some keys larger then specify timestamp

Signed-off-by: 3pointer <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
…tial risk to affect data correctness (tikv#18092)

close tikv#18091

gc_worker: Do not do delete_files_in_range on lock cf which has potential risk to affect data correctness

Signed-off-by: MyonKeminta <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
close tikv#17989

If tso fetch fails, skip updating last_pd_tso.

Signed-off-by: ekexium <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Signed-off-by: Calvin Neo <[email protected]>
Copy link

ti-chi-bot bot commented Jan 22, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from calvinneo, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL label Jan 22, 2025
@CLAassistant
Copy link

CLAassistant commented Jan 22, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
12 out of 27 committers have signed the CLA.

✅ SpadeA-Tang
✅ ekexium
✅ glorv
✅ joccau
✅ 3AceShowHand
✅ gengliqi
✅ wshwsh12
✅ Defined2014
✅ ti-chi-bot
✅ CbcWestwolf
✅ wuhuizuo
✅ CalvinNeo
❌ overvenus
❌ hbisheng
❌ Connor1996
❌ v01dstar
❌ RidRisR
❌ LykxSassinator
❌ hhwyt
❌ YuJuncen
❌ Tristan1900
❌ MyonKeminta
❌ akoshchiy
❌ hazel1225
❌ cfzjywxk
❌ 3pointer
❌ hicqu
You have signed the CLA already but the status is still pending? Let us recheck it.

@CalvinNeo CalvinNeo merged commit fe68a21 into pingcap:raftstore-proxy-backup-20250122 Jan 22, 2025
1 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.