forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-48682][SQL][FOLLOW-UP] Changed initCap behaviour with UTF8_BIN…
…ARY collation ### What changes were proposed in this pull request? Changing the way that spark does initCap with respect to UTF8_BINARY collation. In this PR, initCap titlecases the first character of every word, and lowercases every other character. Words are separated only by ASCII space. Special care is taken when lowercasing Σ, to take into account if it is at the end of the word(with respect to case-ignorable characters) and should be lowercased into ς, or in other case into σ(this already works correctly with the current implementation because lowercasing a whole string handled this, but in this PR this was handled manually because lowercase function wasn't used). The key difference between outputs that this PR introduces is: | input | current_initCap(input) | new_initCap(input) | |----------|----------|----------| | İo | İo (I\u0307o) | İo | | ß fi ffi ff st | ß fi ffi ff st | Ss Fi Ffi Ff St | These are just some examples, much more mappings are actually affected. More details about the key changes are in the next section. This behaviour is put under the ICU_CASE_MAPPINGS_ENABLED flag in SQLConf, which is true by default. ### Why are the changes needed? The previous implementation first lowercases the complete string, and then titlecases the first character of every word[1]. When titlecasing the first character of every word, it maps a single codepoint to a single codepoint[2]. This leads to the following behaviour with respect to [1]: | input | initCap(input) | |----------|----------| | İo | İo (I\u0307o) | In summary, when the lowercase of a first character(for example "İ") in a word maps onto more than 1 character(for example "I\u0307"), we only consider the first character("I" in "I\u0307") of that lowercased letter("İ") for titlecasing instead of that complete character because we titlecase only the first character in a word after we completely lowercase it. The behaviour that [2] produces is: | input | initCap(input) | |----------|----------| | ß fi ffi ff st | ß fi ffi ff st | While the expected output would probably be: | input | initCap(input) | |----------|----------| | ß fi ffi ff st | Ss Fi Ffi Ff St | which clearly maps titlecase of each of those characters into more than one character, which is not handled because of [2]. Again, these are just examples and not an exhaustive list of all the mappings that have been changed. ### Does this PR introduce _any_ user-facing change? Yes, InitCap expression will now return different results for: - One-to-many case mapping (e.g. Turkish dotted I, ß, fi) ### How was this patch tested? Tests in CollationSupportSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47771 from viktorluc-db/initCap. Authored-by: viktorluc-db <[email protected]> Signed-off-by: Max Gekk <[email protected]>
- Loading branch information
1 parent
c58148d
commit fb8d01a
Showing
6 changed files
with
337 additions
and
48 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
33 changes: 33 additions & 0 deletions
33
...on/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/SpecialCodePointConstants.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.catalyst.util; | ||
|
||
/** | ||
* 'SpecialCodePointConstants' is introduced in order to keep the codepoints used in | ||
* 'CollationAwareUTF8String' in one place. | ||
*/ | ||
public class SpecialCodePointConstants { | ||
|
||
public static final int COMBINING_DOT = 0x0307; | ||
public static final int ASCII_SMALL_I = 0x0069; | ||
public static final int ASCII_SPACE = 0x0020; | ||
public static final int GREEK_CAPITAL_SIGMA = 0x03A3; | ||
public static final int GREEK_SMALL_SIGMA = 0x03C3; | ||
public static final int GREEK_FINAL_SIGMA = 0x03C2; | ||
public static final int CAPITAL_I_WITH_DOT_ABOVE = 0x0130; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.