Skip to content

workload-learning: Add a workload-learning cache worker#59909

Merged
ti-chi-bot[bot] merged 1 commit intopingcap:masterfrom
elsa0520:learning-tablecost
Mar 17, 2025
Merged

workload-learning: Add a workload-learning cache worker#59909
ti-chi-bot[bot] merged 1 commit intopingcap:masterfrom
elsa0520:learning-tablecost

Conversation

@elsa0520
Copy link
Contributor

@elsa0520 elsa0520 commented Mar 5, 2025

What problem does this PR solve?

Issue Number:ref #58131

Problem Summary:

What changed and how does it work?

  1. Add a new workload-learning cache worker
  2. Implement the read table cost cache logic from workload_values table to memory

The tablecost cache will be called after WorkloadLearningHandle saving the metrics in workload_values table.

				wbLearningHandle.HandleReadTableCost(do.InfoSchema())
				wbCacheWorker.UpdateTableCostCache()

This ensures that the cacheworker can update the latest data in memory neither too early nor too late.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 5, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2025
@elsa0520 elsa0520 marked this pull request as ready for review March 5, 2025 08:04
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 5, 2025
@tiprow
Copy link

tiprow bot commented Mar 5, 2025

Hi @elsa0520. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov
Copy link

codecov bot commented Mar 5, 2025

Codecov Report

Attention: Patch coverage is 56.75676% with 48 lines in your changes missing coverage. Please review.

Project coverage is 74.9009%. Comparing base (d22abc8) to head (8e42e71).
Report is 1 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #59909        +/-   ##
================================================
+ Coverage   73.1725%   74.9009%   +1.7284%     
================================================
  Files          1706       1753        +47     
  Lines        471408     479651      +8243     
================================================
+ Hits         344941     359263     +14322     
+ Misses       105292      97783      -7509     
- Partials      21175      22605      +1430     
Flag Coverage Δ
integration 48.8658% <9.9099%> (?)
unit 72.3259% <56.7567%> (-0.0358%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.6910% <ø> (ø)
parser ∅ <ø> (∅)
br 62.2958% <ø> (+15.0777%) ⬆️
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@0xPoe 0xPoe moved this to 🧐 Reviewing in 🎒My Work Mar 5, 2025
@0xPoe 0xPoe added this to 🎒My Work Mar 5, 2025
@0xPoe 0xPoe requested review from 0xPoe and Copilot March 5, 2025 09:01
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR introduces a new workload-learning cache worker to refresh and retrieve table cost metrics from the workload_values table in memory. Key changes include:

  • Addition of file pkg/workloadlearning/cache.go implementing the WLCacheWorker and associated caching logic.
  • Integration of the new cache worker into the workload-based learning worker setup in pkg/domain/domain.go.

Reviewed Changes

File Description
pkg/workloadlearning/cache.go Introduces the WLCacheWorker with caching, JSON unmarshalling, and atomic update logic.
pkg/domain/domain.go Integrates the new cache worker into the workload-based learning worker process.

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

pkg/workloadlearning/cache.go:46

  • Consider initializing TableCostMetrics to an empty map (e.g., make(map[int64]*ReadTableCostMetrics)) to avoid potential nil map issues when accessing the cache.
return &WLCacheWorker{pool, &ReadTableCostCache{}, sync.RWMutex{}}

Copy link
Member

@0xPoe 0xPoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

}

// GetTableCostMetrics returns the cached metrics for a given table ID
func (cw *WLCacheWorker) GetTableCostMetrics(tableID int64) *ReadTableCostMetrics {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the priority queue should process all values at once. However, it’s fine to keep the current one and add a new method to retrieve all values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to process all values at once, just don't forget to fetch and release the RWLock ~

@0xPoe 0xPoe requested review from 0xPoe, Copilot and qw4990 March 12, 2025 09:23
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new workload-learning cache worker that maintains an in‑memory cache of table cost metrics and integrates it into the workload learning process.

  • Introduces WLCacheWorker in pkg/workloadlearning/cache.go for caching table cost metrics.
  • Updates the workload learning handle and domain worker to use a DestroyableSessionPool and trigger cache updates.
  • Adds unit tests for the cache update logic in pkg/workloadlearning/cache_test.go.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

File Description
pkg/workloadlearning/cache.go New cache worker implementation for asynchronously updating table cost metrics.
pkg/workloadlearning/cache_test.go Unit tests to verify the cache update functionality.
pkg/workloadlearning/handle.go Updates to use DestroyableSessionPool and improved session cleanup in metric saving.
pkg/domain/domain.go Integration of the new WLCacheWorker into the domain’s workload learning worker.
Comments suppressed due to low confidence (2)

pkg/workloadlearning/cache.go:46

  • It is recommended to initialize the TableCostMetrics field in ReadTableCostCache (e.g., with make(map[int64]*ReadTableCostMetrics)) to avoid potential nil map issues before the cache is updated.
return &WLCacheWorker{pool, &ReadTableCostCache{}, sync.RWMutex{}}

pkg/workloadlearning/handle.go:140

  • Consider verifying that metrics rows have been appended to the SQL builder before removing the trailing comma; otherwise, if no rows were added, this slicing may inadvertently remove part of the header and lead to a malformed SQL statement.
sql := sql.String()[:sql.Len()-2]

func (cw *WLCacheWorker) GetTableCostMetrics(tableID int64) *ReadTableCostMetrics {
cw.RWMutex.RLock()
defer cw.RWMutex.RUnlock()
metric, exists := cw.readTableCostCache.TableCostMetrics[tableID]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a potential nil panic risk here. We should always initialize the TableCostMetrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve by make the map

Copy link
Member

@0xPoe 0xPoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

defer func() {
if err == nil { // only recycle when no error
cw.sysSessionPool.Put(se)
} else if err != nil && se != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think se is never nil here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the below code also has this problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double check the code. The session will be nil before the defer function and it will directly return. So I don't need to recheck in defer function.
Has been changed

ORDER BY version DESC LIMIT 1`
rows, _, err := exec.ExecRestrictedSQL(ctx, nil, sql, feedbackCategory, tableCostType)
if err != nil {
logutil.BgLogger().Warn("Failed to get the latest table cost version", zap.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to print the error stack here? You might need to use ErrVerboseLogger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 12, 2025
cw.RWMutex.Lock()
cw.tableReadCostCache.TableReadCostMetrics = newMetrics
cw.tableReadCostCache.Version = latestVersionInStorage
cw.RWMutex.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use defer for safety (for example, if the above line panic, then we'll hold this lock forever)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 12, 2025
Copy link
Member

@0xPoe 0xPoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! :shipit:

Comment on lines +50 to +52
return &WLCacheWorker{
pool, cache, sync.RWMutex{}}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return &WLCacheWorker{
pool, cache, sync.RWMutex{}}
}
return &WLCacheWorker{
pool, cache, sync.RWMutex{},
}
}

type ReadTableCostMetrics struct {
// TableReadCostMetrics is used to indicate the intermediate status and results analyzed through table read workload
// for function "HandleTableReadCost".
type TableReadCostMetrics struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to add tags for these fields.

@0xPoe 0xPoe requested a review from Copilot March 12, 2025 14:40
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 12, 2025
@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 12, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-03-12 13:30:29.39357574 +0000 UTC m=+362584.999104667: ☑️ agreed by qw4990.
  • 2025-03-12 14:40:17.02365803 +0000 UTC m=+366772.629186955: ☑️ agreed by Rustin170506.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a workload-learning cache worker and updates the table read cost caching logic while renaming functions and variables for improved clarity.

  • Introduces a new WLCacheWorker in pkg/workloadlearning/cache.go and updates its related tests.
  • Renames ReadTableCostMetrics and related functions to TableReadCostMetrics and HandleTableReadCost for consistency.
  • Adjusts the domain worker setup to integrate both the learning handle and the new cache worker.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/workloadlearning/cache_test.go Adds tests for the new cache worker functionality.
pkg/workloadlearning/cache.go Implements caching logic for table read cost metrics.
pkg/workloadlearning/handle.go Updates workload handle with renaming and batch SQL insertion logic.
pkg/domain/domain.go Integrates the new cache worker into the domain’s worker setup.
pkg/workloadlearning/metrics.go Renames metric types from ReadTableCostMetrics to TableReadCostMetrics.
pkg/workloadlearning/handle_test.go Updates unit tests to reflect the renamed functions and types.
Comments suppressed due to low confidence (2)

pkg/workloadlearning/handle.go:109

  • The function name 'analyzeBasedOnStatementSummary' is inconsistent with the later 'analyzeBasedOnStatementStats'; consider unifying the naming to avoid confusion.
func (*Handle) analyzeBasedOnStatementSummary() []*TableReadCostMetrics {

pkg/workloadlearning/handle.go:141

  • [nitpick] Re-declaring the variable 'sql' here shadows the outer variable; consider using a new variable name (e.g., 'finalSQL') for clarity.
sql := sql.String()[:sql.Len()-2]

Copy link
Contributor

@lance6716 lance6716 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

for domain part

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lance6716, qw4990, Rustin170506

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Mar 13, 2025
1. Add a new workload-learning cache worker
2. Implement the read table cost cache logic from workload_values table to memory
@elsa0520 elsa0520 force-pushed the learning-tablecost branch from 2bca707 to 8e42e71 Compare March 17, 2025 05:10
@elsa0520
Copy link
Contributor Author

/test all-tests

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 17, 2025

@elsa0520: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test build
/test check-dev
/test check-dev2
/test mysql-test
/test pull-br-integration-test
/test pull-integration-ddl-test
/test pull-integration-e2e-test
/test pull-lightning-integration-test
/test pull-mysql-client-test
/test pull-unit-test-ddlv1
/test unit-test

The following commands are available to trigger optional jobs:

/test pingcap/tidb/canary_ghpr_unit_test
/test pull-common-test
/test pull-e2e-test
/test pull-integration-common-test
/test pull-integration-copr-test
/test pull-integration-jdbc-test
/test pull-integration-mysql-test
/test pull-integration-nodejs-test
/test pull-integration-python-orm-test
/test pull-scan-deps
/test pull-sqllogic-test
/test pull-tiflash-test

Use /test all to run the following jobs that were automatically triggered:

pingcap/tidb/ghpr_build
pingcap/tidb/ghpr_check
pingcap/tidb/ghpr_check2
pingcap/tidb/ghpr_mysql_test
pingcap/tidb/ghpr_unit_test
pingcap/tidb/pull_br_integration_test
pingcap/tidb/pull_integration_ddl_test
pingcap/tidb/pull_integration_e2e_test
pingcap/tidb/pull_lightning_integration_test
pingcap/tidb/pull_mysql_client_test
Details

In response to this:

/test all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tiprow
Copy link

tiprow bot commented Mar 17, 2025

@elsa0520: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/test all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tiprow
Copy link

tiprow bot commented Mar 17, 2025

@ti-chi-bot[bot]: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

@elsa0520: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test build
/test check-dev
/test check-dev2
/test mysql-test
/test pull-br-integration-test
/test pull-integration-ddl-test
/test pull-integration-e2e-test
/test pull-lightning-integration-test
/test pull-mysql-client-test
/test pull-unit-test-ddlv1
/test unit-test

The following commands are available to trigger optional jobs:

/test pingcap/tidb/canary_ghpr_unit_test
/test pull-common-test
/test pull-e2e-test
/test pull-integration-common-test
/test pull-integration-copr-test
/test pull-integration-jdbc-test
/test pull-integration-mysql-test
/test pull-integration-nodejs-test
/test pull-integration-python-orm-test
/test pull-scan-deps
/test pull-sqllogic-test
/test pull-tiflash-test

Use /test all to run the following jobs that were automatically triggered:

pingcap/tidb/ghpr_build
pingcap/tidb/ghpr_check
pingcap/tidb/ghpr_check2
pingcap/tidb/ghpr_mysql_test
pingcap/tidb/ghpr_unit_test
pingcap/tidb/pull_br_integration_test
pingcap/tidb/pull_integration_ddl_test
pingcap/tidb/pull_integration_e2e_test
pingcap/tidb/pull_lightning_integration_test
pingcap/tidb/pull_mysql_client_test

In response to this:

/test all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@elsa0520
Copy link
Contributor Author

/test pull-br-integration-test

@tiprow
Copy link

tiprow bot commented Mar 17, 2025

@elsa0520: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/test pull-br-integration-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot merged commit daa3e33 into pingcap:master Mar 17, 2025
24 checks passed
@0xPoe 0xPoe moved this from 🧐 Reviewing to ✅ Done in 🎒My Work Apr 2, 2025
zeminzhou pushed a commit to zeminzhou/tidb that referenced this pull request May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants