support comparison semantics for batch serialize/deserialize of Column #9756

guo-shaoge · 2025-01-01T09:57:39Z

What problem does this PR solve?

Issue Number: close #9761

Problem Summary: #9553 add serialize/deserialize interface by column-wise, but it didn't handle collator for ColumnString and real copy format for ColumnDecimal([sign, limb_count, limb_data])

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

Signed-off-by: guo-shaoge <[email protected]>

…batch_serialize

Signed-off-by: guo-shaoge <[email protected]>

This reverts commit c07d13a.

This reverts commit abd55ac.

Signed-off-by: guo-shaoge <[email protected]>

guo-shaoge · 2025-01-07T02:33:22Z

/retest

Signed-off-by: guo-shaoge <[email protected]>

gengliqi · 2025-01-08T09:37:01Z

dbms/src/Columns/ColumnAggregateFunction.h

@@ -165,10 +165,27 @@ class ColumnAggregateFunction final : public COWPtrHelper<IColumn, ColumnAggrega

    const char * deserializeAndInsertFromArena(const char * src_arena, const TiDB::TiDBCollatorPtr &) override;

+    void countSerializeByteSizeUnique(


Would it make more sense to rename countSerializeByteSizeUnique to countSerializeUniqueByteSize for better readability?
Similarly, the name for other methods could follow the same pattern. E.g. serializeToPosUnique to serializeUniqueToPos and deserializeAndInsertFromPosUnique to deserializeUniqueAndInsertFromPos.

dbms/src/TiDB/Collation/Collator.cpp

dbms/src/Columns/ColumnString.cpp

yibin87 · 2025-01-09T05:16:28Z

dbms/src/Columns/ColumnDecimal.cpp

        }
        else
        {
-            inline_memcpy(pos[i], &data[array_offsets[start + i - 1]], len * sizeof(T));
+            if (len <= 4)


Please add some comments here to explain why is 4 here

It's not my code. @gengliqi can you help to explain?

It's just a simple optimization. If the length is very small, copying them one by one is faster than std::memcpy.

dbms/src/Columns/ColumnDecimal.cpp

Signed-off-by: guo-shaoge <[email protected]>

yibin87

LGTM

ti-chi-bot · 2025-01-10T05:57:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yibin87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yibin87]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2025-01-10T05:57:26Z

[LGTM Timeline notifier]

Timeline:

2025-01-10 05:57:26.14022174 +0000 UTC m=+505989.429053445: ☑️ agreed by yibin87.

dbms/src/Columns/ColumnArray.h

Signed-off-by: guo-shaoge <[email protected]>

gengliqi · 2025-01-10T09:15:12Z

dbms/src/Columns/ColumnDecimal.cpp

        }
        else
        {
-            inline_memcpy(pos[i], &data[array_offsets[start + i - 1]], len * sizeof(T));
+            if (len <= 4)


It's just a simple optimization. If the length is very small, copying them one by one is faster than std::memcpy.

gengliqi · 2025-01-10T09:52:17Z

dbms/src/Columns/ColumnString.cpp

+        {
+            assert(sizeAt(i) >= 1);
+            // Minus 1 because of terminating zero.
+            byte_size[i] += sizeof(UInt32) + (sizeAt(i) - 1) * max_bytes_one_char;


size * max_bytes_one_char may waste lots of memory. For example, a utf8 character is 3 bytes, max_bytes_one_char is 4 bytes. 4 bytes is enough but here need 12 bytes.

gengliqi · 2025-01-10T09:54:22Z

dbms/src/Columns/ColumnString.cpp

+        const void * src = &chars[offsetAt(start + i)];
+        if constexpr (has_collator)
+        {
+            auto sort_key = collator->sortKey(reinterpret_cast<const char *>(src), str_size - 1, *sort_key_container);


The sortKey is a virtual function. How about adding a batch version to reduce the overhead of virtual functions?

guo-shaoge added 5 commits December 31, 2024 12:47

basically done

e4b57c8

Signed-off-by: guo-shaoge <[email protected]>

fix compilation

1574825

Signed-off-by: guo-shaoge <[email protected]>

fmt

4203af0

Signed-off-by: guo-shaoge <[email protected]>

compile && nt_optimization

29021b2

Signed-off-by: guo-shaoge <[email protected]>

unit test

a3cd638

Signed-off-by: guo-shaoge <[email protected]>

ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 1, 2025

guo-shaoge added 2 commits January 1, 2025 17:58

refine

d155847

Signed-off-by: guo-shaoge <[email protected]>

fix

d3b0300

Signed-off-by: guo-shaoge <[email protected]>

ti-chi-bot bot removed the do-not-merge/needs-linked-issue label Jan 2, 2025

guo-shaoge requested a review from gengliqi January 2, 2025 09:33

guo-shaoge added 3 commits January 2, 2025 19:17

Merge branch 'master' into batch_serialize

e8564aa

test new impl

abd55ac

Signed-off-by: guo-shaoge <[email protected]>

Merge branch 'batch_serialize' of github.com:guo-shaoge/tiflash into …

4cac26a

…batch_serialize

guo-shaoge force-pushed the batch_serialize branch from a311faa to eeb1ac3 Compare January 3, 2025 04:21

test ci impl

c07d13a

Signed-off-by: guo-shaoge <[email protected]>

guo-shaoge force-pushed the batch_serialize branch from eeb1ac3 to c07d13a Compare January 3, 2025 04:58

guo-shaoge added 2 commits January 6, 2025 15:50

Revert "test ci impl"

086b630

This reverts commit c07d13a.

Revert "test new impl"

db8d490

This reverts commit abd55ac.

guo-shaoge force-pushed the batch_serialize branch from e19c851 to 7c725cb Compare January 6, 2025 08:18

change name

84ee65b

Signed-off-by: guo-shaoge <[email protected]>

guo-shaoge force-pushed the batch_serialize branch from 7c725cb to 84ee65b Compare January 6, 2025 08:36

guo-shaoge added 3 commits January 6, 2025 16:48

is_fast -> ensure_unique

3800d0f

Signed-off-by: guo-shaoge <[email protected]>

batchSerializeImpl -> serializeToPosImpl

a6fac1f

Signed-off-by: guo-shaoge <[email protected]>

ci

19982d3

Signed-off-by: guo-shaoge <[email protected]>

refine

47cdf91

Signed-off-by: guo-shaoge <[email protected]>

guo-shaoge force-pushed the batch_serialize branch from 3878b4b to 47cdf91 Compare January 7, 2025 03:04

guo-shaoge requested a review from yibin87 January 7, 2025 03:46

Merge branch 'master' of github.com:pingcap/tiflash into batch_serialize

1342f6a

gengliqi reviewed Jan 8, 2025

View reviewed changes

yibin87 reviewed Jan 8, 2025

View reviewed changes

dbms/src/TiDB/Collation/Collator.cpp Outdated Show resolved Hide resolved

dbms/src/Columns/ColumnString.cpp Outdated Show resolved Hide resolved

yibin87 reviewed Jan 9, 2025

View reviewed changes

refine

2a6a5f5

Signed-off-by: guo-shaoge <[email protected]>

guo-shaoge requested a review from yibin87 January 10, 2025 03:17

refine

7d910e5

Signed-off-by: guo-shaoge <[email protected]>

yibin87 approved these changes Jan 10, 2025

View reviewed changes

ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jan 10, 2025

gengliqi reviewed Jan 10, 2025

View reviewed changes

dbms/src/Columns/ColumnArray.h Outdated Show resolved Hide resolved

guo-shaoge requested a review from gengliqi January 10, 2025 08:55

guo-shaoge changed the title ~~support batch serialize/deserialize method for Column~~ support unique family of batch serialize/deserialize of Column Jan 10, 2025

guo-shaoge changed the title ~~support unique family of batch serialize/deserialize of Column~~ support unique semantics of batch serialize/deserialize of Column Jan 10, 2025

guo-shaoge changed the title ~~support unique semantics of batch serialize/deserialize of Column~~ support unique semantics for batch serialize/deserialize of Column Jan 10, 2025

guo-shaoge force-pushed the batch_serialize branch from 38e03ed to 8070321 Compare January 10, 2025 09:16

guo-shaoge changed the title ~~support unique semantics for batch serialize/deserialize of Column~~ support comparison semantics for batch serialize/deserialize of Column Jan 10, 2025

guo-shaoge force-pushed the batch_serialize branch from 8070321 to 06ff1d4 Compare January 10, 2025 09:20

refine

7572680

Signed-off-by: guo-shaoge <[email protected]>

guo-shaoge force-pushed the batch_serialize branch from 06ff1d4 to 7572680 Compare January 10, 2025 09:27

gengliqi reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support comparison semantics for batch serialize/deserialize of Column #9756

support comparison semantics for batch serialize/deserialize of Column #9756

guo-shaoge commented Jan 1, 2025 •

edited

Loading

guo-shaoge commented Jan 7, 2025

gengliqi Jan 8, 2025

yibin87 Jan 9, 2025

guo-shaoge Jan 10, 2025

gengliqi Jan 10, 2025

yibin87 left a comment

ti-chi-bot bot commented Jan 10, 2025

ti-chi-bot bot commented Jan 10, 2025

gengliqi Jan 10, 2025

gengliqi Jan 10, 2025

gengliqi Jan 10, 2025

		@@ -165,10 +165,27 @@ class ColumnAggregateFunction final : public COWPtrHelper<IColumn, ColumnAggrega

		const char * deserializeAndInsertFromArena(const char * src_arena, const TiDB::TiDBCollatorPtr &) override;

		void countSerializeByteSizeUnique(

support comparison semantics for batch serialize/deserialize of Column #9756

Are you sure you want to change the base?

support comparison semantics for batch serialize/deserialize of Column #9756

Conversation

guo-shaoge commented Jan 1, 2025 • edited Loading

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

guo-shaoge commented Jan 7, 2025

gengliqi Jan 8, 2025

Choose a reason for hiding this comment

yibin87 Jan 9, 2025

Choose a reason for hiding this comment

guo-shaoge Jan 10, 2025

Choose a reason for hiding this comment

gengliqi Jan 10, 2025

Choose a reason for hiding this comment

yibin87 left a comment

Choose a reason for hiding this comment

ti-chi-bot bot commented Jan 10, 2025

ti-chi-bot bot commented Jan 10, 2025

[LGTM Timeline notifier]

gengliqi Jan 10, 2025

Choose a reason for hiding this comment

gengliqi Jan 10, 2025

Choose a reason for hiding this comment

gengliqi Jan 10, 2025

Choose a reason for hiding this comment

guo-shaoge commented Jan 1, 2025 •

edited

Loading