-
Notifications
You must be signed in to change notification settings - Fork 15
fix(ut,orc): fix orc read ut under debian and adjust some options #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@lucasfang @lxy-9602 @ChaomingZhangCN PTAL, thanks! |
CI report errors as follows: |
Problem solved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses ORC timestamp reading issues caused by timezone data differences between Debian and other Linux distributions, specifically for timestamps prior to 1901 when using the Asia/Shanghai timezone.
Key changes:
- Sets the ORC reader timezone to GMT to avoid timezone conversion issues during reading
- Adds OS detection utility to handle different expected test values on Debian vs other platforms
- Updates test expectations to account for the 5 minutes and 43 seconds timezone offset present in Debian's timezone data
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/paimon/format/orc/orc_file_batch_reader.cpp | Sets ORC reader timezone to GMT to prevent timezone conversion errors |
| src/paimon/testing/utils/testharness.h | Adds OsReleaseDetector utility class to detect Debian OS |
| test/inte/read_inte_test.cpp | Adds conditional test expectations for Debian vs non-Debian platforms; includes unintentional whitespace change |
| src/paimon/format/orc/orc_file_batch_reader_test.cpp | Adds conditional test expectations for Debian vs non-Debian platforms |
| src/paimon/format/orc/complex_predicate_test.cpp | Adds conditional test expectations and removes unused field declarations from SetUp |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@lucasfang PTAL and merge it. |
Purpose
Linked issue: close #xxx
The specific verification steps are as follows: I modified the code related to ORC format and added some debug logs, focusing primarily on the code for timestamp time zone conversion. The details are as follows:
In the Debian environment, I used Ubuntu 24.04's tzdata (wget http://security.ubuntu.com/ubuntu/pool/main/t/tzdata/tzdata_2025b-0ubuntu0.24.04.1_all.deb) and Debian's tzdata for the TZDIR environment variable separately, with the results as follows.
Debian:
Ubuntu 24.04
Analysis
The result is self-evident: Debian’s tzdata contains an extra LMT Timezone entry, which ultimately leads to a 5 minutes and 30 seconds offset in the final result.
Root Cause of the Issue: ORC Timestamp conversion has bugs
First, it is essential to understand how ORC stores Timestamps:
During Writing
What is stored is the second offset relative to epoch_.
Calculation of epoch_
A critical issue here is:epoch_ is calculated using the time zone variant offset at the moment of 2015-01-01.
Core Conflict
cppwriterTime = secsBuffer[i] + epochOffset_;This formula implies the following assumption:secsBuffer[i] is an offset relative to a "local time epoch",yet this "local time epoch" uses the time zone rules valid as of 2015-01-01.Errors occur when the actual time point (epoch + secsBuffer[i]) applies different time zone rules.
Fix
For the time being, I have bypassed this issue by detecting the OS version. To resolve the problem at its root, we need to ensure that the writer zone of the ORC file is set to the UTC/GMT time zone for data writing, thus avoiding similar issues.
Tests
API and Format
Documentation