Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support comparison semantics for batch serialize/deserialize of Column #9756

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

guo-shaoge
Copy link
Contributor

@guo-shaoge guo-shaoge commented Jan 1, 2025

What problem does this PR solve?

Issue Number: close #9761

Problem Summary: #9553 add serialize/deserialize interface by column-wise, but it didn't handle collator for ColumnString and real copy format for ColumnDecimal([sign, limb_count, limb_data])

What is changed and how it works?


Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 1, 2025
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
This reverts commit c07d13a.
This reverts commit abd55ac.
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
Signed-off-by: guo-shaoge <[email protected]>
@guo-shaoge
Copy link
Contributor Author

/retest

Signed-off-by: guo-shaoge <[email protected]>
@@ -165,10 +165,27 @@ class ColumnAggregateFunction final : public COWPtrHelper<IColumn, ColumnAggrega

const char * deserializeAndInsertFromArena(const char * src_arena, const TiDB::TiDBCollatorPtr &) override;

void countSerializeByteSizeUnique(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to rename countSerializeByteSizeUnique to countSerializeUniqueByteSize for better readability?
Similarly, the name for other methods could follow the same pattern. E.g. serializeToPosUnique to serializeUniqueToPos and deserializeAndInsertFromPosUnique to deserializeUniqueAndInsertFromPos.

dbms/src/TiDB/Collation/Collator.cpp Outdated Show resolved Hide resolved
dbms/src/Columns/ColumnString.cpp Outdated Show resolved Hide resolved
dbms/src/Columns/ColumnString.cpp Show resolved Hide resolved
dbms/src/Columns/ColumnString.cpp Show resolved Hide resolved
}
else
{
inline_memcpy(pos[i], &data[array_offsets[start + i - 1]], len * sizeof(T));
if (len <= 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some comments here to explain why is 4 here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not my code. @gengliqi can you help to explain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a simple optimization. If the length is very small, copying them one by one is faster than std::memcpy.

dbms/src/Columns/ColumnDecimal.cpp Show resolved Hide resolved
Signed-off-by: guo-shaoge <[email protected]>
@guo-shaoge guo-shaoge requested a review from yibin87 January 10, 2025 03:17
Signed-off-by: guo-shaoge <[email protected]>
Copy link
Contributor

@yibin87 yibin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

ti-chi-bot bot commented Jan 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yibin87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jan 10, 2025
Copy link
Contributor

ti-chi-bot bot commented Jan 10, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-01-10 05:57:26.14022174 +0000 UTC m=+505989.429053445: ☑️ agreed by yibin87.

@guo-shaoge guo-shaoge requested a review from gengliqi January 10, 2025 08:55
@guo-shaoge guo-shaoge changed the title support batch serialize/deserialize method for Column support unique family of batch serialize/deserialize of Column Jan 10, 2025
@guo-shaoge guo-shaoge changed the title support unique family of batch serialize/deserialize of Column support unique semantics of batch serialize/deserialize of Column Jan 10, 2025
@guo-shaoge guo-shaoge changed the title support unique semantics of batch serialize/deserialize of Column support unique semantics for batch serialize/deserialize of Column Jan 10, 2025
@guo-shaoge guo-shaoge changed the title support unique semantics for batch serialize/deserialize of Column support comparison semantics for batch serialize/deserialize of Column Jan 10, 2025
Signed-off-by: guo-shaoge <[email protected]>
}
else
{
inline_memcpy(pos[i], &data[array_offsets[start + i - 1]], len * sizeof(T));
if (len <= 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a simple optimization. If the length is very small, copying them one by one is faster than std::memcpy.

{
assert(sizeAt(i) >= 1);
// Minus 1 because of terminating zero.
byte_size[i] += sizeof(UInt32) + (sizeAt(i) - 1) * max_bytes_one_char;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size * max_bytes_one_char may waste lots of memory. For example, a utf8 character is 3 bytes, max_bytes_one_char is 4 bytes. 4 bytes is enough but here need 12 bytes.

const void * src = &chars[offsetAt(start + i)];
if constexpr (has_collator)
{
auto sort_key = collator->sortKey(reinterpret_cast<const char *>(src), str_size - 1, *sort_key_container);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sortKey is a virtual function. How about adding a batch version to reduce the overhead of virtual functions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support serialize/deserialize by column-wise for collator
3 participants