workload-learning: Add a workload-learning cache worker by elsa0520 · Pull Request #59909 · pingcap/tidb

elsa0520 · 2025-03-05T08:04:16Z

What problem does this PR solve?

Issue Number:ref #58131

Problem Summary:

What changed and how does it work?

Add a new workload-learning cache worker
Implement the read table cost cache logic from workload_values table to memory

The tablecost cache will be called after WorkloadLearningHandle saving the metrics in workload_values table.

				wbLearningHandle.HandleReadTableCost(do.InfoSchema())
				wbCacheWorker.UpdateTableCostCache()

This ensures that the cacheworker can update the latest data in memory neither too early nor too late.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

ti-chi-bot · 2025-03-05T08:04:18Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

tiprow · 2025-03-05T08:04:37Z

Hi @elsa0520. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

codecov · 2025-03-05T08:21:52Z

Codecov Report

Attention: Patch coverage is 56.75676% with 48 lines in your changes missing coverage. Please review.

Project coverage is 74.9009%. Comparing base (d22abc8) to head (8e42e71).
Report is 1 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #59909        +/-   ##
================================================
+ Coverage   73.1725%   74.9009%   +1.7284%     
================================================
  Files          1706       1753        +47     
  Lines        471408     479651      +8243     
================================================
+ Hits         344941     359263     +14322     
+ Misses       105292      97783      -7509     
- Partials      21175      22605      +1430

Flag	Coverage Δ
integration	`48.8658% <9.9099%> (?)`
unit	`72.3259% <56.7567%> (-0.0358%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`52.6910% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`62.2958% <ø> (+15.0777%)`	⬆️

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

PR Overview

This PR introduces a new workload-learning cache worker to refresh and retrieve table cost metrics from the workload_values table in memory. Key changes include:

Addition of file pkg/workloadlearning/cache.go implementing the WLCacheWorker and associated caching logic.
Integration of the new cache worker into the workload-based learning worker setup in pkg/domain/domain.go.

Reviewed Changes

File	Description
pkg/workloadlearning/cache.go	Introduces the WLCacheWorker with caching, JSON unmarshalling, and atomic update logic.
pkg/domain/domain.go	Integrates the new cache worker into the workload-based learning worker process.

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

pkg/workloadlearning/cache.go:46

Consider initializing TableCostMetrics to an empty map (e.g., make(map[int64]*ReadTableCostMetrics)) to avoid potential nil map issues when accessing the cache.

return &WLCacheWorker{pool, &ReadTableCostCache{}, sync.RWMutex{}}

pkg/domain/domain.go

pkg/workloadlearning/cache.go

0xPoe

Thanks!

0xPoe · 2025-03-06T08:00:53Z

pkg/workloadlearning/cache.go

+}
+
+// GetTableCostMetrics returns the cached metrics for a given table ID
+func (cw *WLCacheWorker) GetTableCostMetrics(tableID int64) *ReadTableCostMetrics {


I think the priority queue should process all values at once. However, it’s fine to keep the current one and add a new method to retrieve all values.

If you need to process all values at once, just don't forget to fetch and release the RWLock ~

pkg/workloadlearning/cache.go

Copilot

Pull Request Overview

This PR adds a new workload-learning cache worker that maintains an in‑memory cache of table cost metrics and integrates it into the workload learning process.

Introduces WLCacheWorker in pkg/workloadlearning/cache.go for caching table cost metrics.
Updates the workload learning handle and domain worker to use a DestroyableSessionPool and trigger cache updates.
Adds unit tests for the cache update logic in pkg/workloadlearning/cache_test.go.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

File	Description
pkg/workloadlearning/cache.go	New cache worker implementation for asynchronously updating table cost metrics.
pkg/workloadlearning/cache_test.go	Unit tests to verify the cache update functionality.
pkg/workloadlearning/handle.go	Updates to use DestroyableSessionPool and improved session cleanup in metric saving.
pkg/domain/domain.go	Integration of the new WLCacheWorker into the domain’s workload learning worker.

Comments suppressed due to low confidence (2)

pkg/workloadlearning/cache.go:46

It is recommended to initialize the TableCostMetrics field in ReadTableCostCache (e.g., with make(map[int64]*ReadTableCostMetrics)) to avoid potential nil map issues before the cache is updated.

return &WLCacheWorker{pool, &ReadTableCostCache{}, sync.RWMutex{}}

pkg/workloadlearning/handle.go:140

Consider verifying that metrics rows have been appended to the SQL builder before removing the trailing comma; otherwise, if no rows were added, this slicing may inadvertently remove part of the header and lead to a malformed SQL statement.

sql := sql.String()[:sql.Len()-2]

0xPoe · 2025-03-12T09:29:08Z

pkg/workloadlearning/cache.go

+func (cw *WLCacheWorker) GetTableCostMetrics(tableID int64) *ReadTableCostMetrics {
+	cw.RWMutex.RLock()
+	defer cw.RWMutex.RUnlock()
+	metric, exists := cw.readTableCostCache.TableCostMetrics[tableID]


I think there is a potential nil panic risk here. We should always initialize the TableCostMetrics.

Resolve by make the map

0xPoe

Thanks!

0xPoe · 2025-03-12T09:30:11Z

pkg/workloadlearning/cache.go

+	defer func() {
+		if err == nil { // only recycle when no error
+			cw.sysSessionPool.Put(se)
+		} else if err != nil && se != nil {


I think se is never nil here.

I guess the below code also has this problem.

I double check the code. The session will be nil before the defer function and it will directly return. So I don't need to recheck in defer function.
Has been changed

0xPoe · 2025-03-12T09:31:21Z

pkg/workloadlearning/cache.go

+            ORDER BY version DESC LIMIT 1`
+	rows, _, err := exec.ExecRestrictedSQL(ctx, nil, sql, feedbackCategory, tableCostType)
+	if err != nil {
+		logutil.BgLogger().Warn("Failed to get the latest table cost version", zap.Error(err))


Do you want to print the error stack here? You might need to use ErrVerboseLogger.

qw4990 · 2025-03-12T13:29:45Z

pkg/workloadlearning/cache.go

+	cw.RWMutex.Lock()
+	cw.tableReadCostCache.TableReadCostMetrics = newMetrics
+	cw.tableReadCostCache.Version = latestVersionInStorage
+	cw.RWMutex.Unlock()


better to use defer for safety (for example, if the above line panic, then we'll hold this lock forever)

0xPoe

Thanks!

0xPoe · 2025-03-12T14:33:28Z

pkg/workloadlearning/cache.go

+	return &WLCacheWorker{
+		pool, cache, sync.RWMutex{}}
+}


Suggested change

return &WLCacheWorker{

pool, cache, sync.RWMutex{}}

}

return &WLCacheWorker{

pool, cache, sync.RWMutex{},

}

}

0xPoe · 2025-03-12T14:39:55Z

pkg/workloadlearning/metrics.go

-type ReadTableCostMetrics struct {
+// TableReadCostMetrics is used to indicate the intermediate status and results analyzed through table read workload
+// for function "HandleTableReadCost".
+type TableReadCostMetrics struct {


Maybe we need to add tags for these fields.

ti-chi-bot · 2025-03-12T14:40:18Z

[LGTM Timeline notifier]

Timeline:

2025-03-12 13:30:29.39357574 +0000 UTC m=+362584.999104667: ☑️ agreed by qw4990.
2025-03-12 14:40:17.02365803 +0000 UTC m=+366772.629186955: ☑️ agreed by Rustin170506.

Copilot

Pull Request Overview

This PR adds a workload-learning cache worker and updates the table read cost caching logic while renaming functions and variables for improved clarity.

Introduces a new WLCacheWorker in pkg/workloadlearning/cache.go and updates its related tests.
Renames ReadTableCostMetrics and related functions to TableReadCostMetrics and HandleTableReadCost for consistency.
Adjusts the domain worker setup to integrate both the learning handle and the new cache worker.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
pkg/workloadlearning/cache_test.go	Adds tests for the new cache worker functionality.
pkg/workloadlearning/cache.go	Implements caching logic for table read cost metrics.
pkg/workloadlearning/handle.go	Updates workload handle with renaming and batch SQL insertion logic.
pkg/domain/domain.go	Integrates the new cache worker into the domain’s worker setup.
pkg/workloadlearning/metrics.go	Renames metric types from ReadTableCostMetrics to TableReadCostMetrics.
pkg/workloadlearning/handle_test.go	Updates unit tests to reflect the renamed functions and types.

Comments suppressed due to low confidence (2)

pkg/workloadlearning/handle.go:109

The function name 'analyzeBasedOnStatementSummary' is inconsistent with the later 'analyzeBasedOnStatementStats'; consider unifying the naming to avoid confusion.

func (*Handle) analyzeBasedOnStatementSummary() []*TableReadCostMetrics {

pkg/workloadlearning/handle.go:141

[nitpick] Re-declaring the variable 'sql' here shadows the outer variable; consider using a new variable name (e.g., 'finalSQL') for clarity.

sql := sql.String()[:sql.Len()-2]

lance6716

/approve

for domain part

ti-chi-bot · 2025-03-13T14:36:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lance6716, qw4990, Rustin170506

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Rustin170506,lance6716,qw4990]
~~pkg/domain/OWNERS~~ [lance6716]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1. Add a new workload-learning cache worker 2. Implement the read table cost cache logic from workload_values table to memory

elsa0520 · 2025-03-17T06:12:09Z

/test all-tests

ti-chi-bot · 2025-03-17T06:12:12Z

@elsa0520: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test build

/test check-dev

/test check-dev2

/test mysql-test

/test pull-br-integration-test

/test pull-integration-ddl-test

/test pull-integration-e2e-test

/test pull-lightning-integration-test

/test pull-mysql-client-test

/test pull-unit-test-ddlv1

/test unit-test

The following commands are available to trigger optional jobs:

/test pingcap/tidb/canary_ghpr_unit_test

/test pull-common-test

/test pull-e2e-test

/test pull-integration-common-test

/test pull-integration-copr-test

/test pull-integration-jdbc-test

/test pull-integration-mysql-test

/test pull-integration-nodejs-test

/test pull-integration-python-orm-test

/test pull-scan-deps

/test pull-sqllogic-test

/test pull-tiflash-test

Use /test all to run the following jobs that were automatically triggered:

pingcap/tidb/ghpr_build

pingcap/tidb/ghpr_check

pingcap/tidb/ghpr_check2

pingcap/tidb/ghpr_mysql_test

pingcap/tidb/ghpr_unit_test

pingcap/tidb/pull_br_integration_test

pingcap/tidb/pull_integration_ddl_test

pingcap/tidb/pull_integration_e2e_test

pingcap/tidb/pull_lightning_integration_test

pingcap/tidb/pull_mysql_client_test

Details

In response to this:

/test all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tiprow · 2025-03-17T06:12:38Z

@elsa0520: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/test all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tiprow · 2025-03-17T06:12:41Z

@ti-chi-bot[bot]: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

@elsa0520: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:
/test build
/test check-dev
/test check-dev2
/test mysql-test
/test pull-br-integration-test
/test pull-integration-ddl-test
/test pull-integration-e2e-test
/test pull-lightning-integration-test
/test pull-mysql-client-test
/test pull-unit-test-ddlv1
/test unit-test
The following commands are available to trigger optional jobs:
/test pingcap/tidb/canary_ghpr_unit_test
/test pull-common-test
/test pull-e2e-test
/test pull-integration-common-test
/test pull-integration-copr-test
/test pull-integration-jdbc-test
/test pull-integration-mysql-test
/test pull-integration-nodejs-test
/test pull-integration-python-orm-test
/test pull-scan-deps
/test pull-sqllogic-test
/test pull-tiflash-test
Use /test all to run the following jobs that were automatically triggered:
pingcap/tidb/ghpr_build
pingcap/tidb/ghpr_check
pingcap/tidb/ghpr_check2
pingcap/tidb/ghpr_mysql_test
pingcap/tidb/ghpr_unit_test
pingcap/tidb/pull_br_integration_test
pingcap/tidb/pull_integration_ddl_test
pingcap/tidb/pull_integration_e2e_test
pingcap/tidb/pull_lightning_integration_test
pingcap/tidb/pull_mysql_client_test
In response to this:

/test all-tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

elsa0520 · 2025-03-17T06:12:42Z

/test pull-br-integration-test

tiprow · 2025-03-17T06:13:03Z

@elsa0520: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/test pull-br-integration-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ref pingcap#58131

ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2025

elsa0520 marked this pull request as ready for review March 5, 2025 08:04

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 5, 2025

0xPoe moved this to 🧐 Reviewing in 🎒My Work Mar 5, 2025

0xPoe added this to 🎒My Work Mar 5, 2025

0xPoe requested review from 0xPoe and Copilot March 5, 2025 09:01

Copilot AI reviewed Mar 5, 2025

View reviewed changes

pkg/domain/domain.go Outdated Show resolved Hide resolved