-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] RowReaderImpl::next return inconsistant data in certain case #1512
Comments
Thank you for reporting, @chenpingzeng . cc @wgtmac , @stiga-huang |
Hi @chenpingzeng, https://github.com/apache/orc/pull/1469/files has discussed the same thing. BTW, the comment seems incorrect. I have opened https://github.com/apache/orc/pull/1515/files to fix it. |
Thanks for the advising pr. The key attension was to avoid using memset the notNull.data() to 1 for little bit performance negative effects.
I would like to share some experience of using result of orc::RowReader.next, not obvious performance improment in tpcds-99 test with 3TB data set when we directly copy orc::StructVectorBatch.fields[i] to dest obj memory, in case when hasNulls=0, bypass the unnessary checking of notNull.data()[i], sure it is meanningful to do this code refacting. So I think it is not a wasting of performance to ensure all values in notNull.data() are 1 when hasNulls=0.
On the other side, the problem for ‘miss use of reading notNull.data()‘ did exist after half a year until tpcds99 consistant checking very recently. As I mentioned in the issue, it was extremely difficult to figure out the condition to find out the problem data row, since over 8 billion records in table store_sales, or even more records in other scenes. That is say, from an expert or god view, sure it is user’s problem to read the notNull.data() when hasNull=0. Does any one has considered this question: do we have stop user stepping into this strap.?(Yes, I think it is a strap that notNull.data() has some 0 values when hasNull=0, it is a data inconsistent in my opinion)
发件人: Gang Wu ***@***.***>
发送时间: 2023年5月23日 15:32
收件人: apache/orc ***@***.***>
抄送: Chenpingzeng ***@***.***>; Mention ***@***.***>
主题: Re: [apache/orc] [C++] RowReaderImpl::next return inconsistant data in certain case (Issue #1512)
Hi @chenpingzeng<https://github.com/chenpingzeng>, https://github.com/apache/orc/pull/1469/files has discussed the same thing. Please check the comment below to see if it solves this issue.
https://github.com/apache/orc/blob/main/c%2B%2B/include/orc/Vector.hh#L40-L44
—
Reply to this email directly, view it on GitHub<#1512 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AN4CUN44RBRDXOYQCHSEJ7DXHRRYNANCNFSM6AAAAAAYKKRVYE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background info:
In a spark project, we are using orc c++ as acceleration lib to access hdfs files, comparing to original spark table scan with java/scala code.
We found some tpcds-99 sql run fail with data inconstant, occasionally, by now SQL-9/SQL-28/SQL-47 encounters the problem in different clusters.
Appearance:
An example as below, 3TB data generated with tpcds-tool, when executing with 'select ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, ss_hdemo_sk, ss_addr_sk, ss_store_sk, ss_promo_sk, ss_ticket_number, ss_quantity, ss_wholesale_cost, ss_list_price, ss_sales_price, ss_ext_discount_amt, ss_ext_sales_price, ss_ext_wholesale_cost, ss_ext_list_price, ss_ext_tax, ss_coupon_amt, ss_net_paid, ss_net_paid_inc_tax, ss_net_profit, ss_sold_date_sk from store_sales where ss_item_sk=302314 and ss_customer_sk=11587351 and ss_quantity=24;',we got result as below:
The probleam here is ss_list_price/ss_ext_sales_price actual value is 15.81/7.44, but we get NULL
Analize Info:
It is difficult to add some debug info for a table with over 8 billion records in the issue. we first got a way to check the return data of RowReaderImpl::next and found that data.data() were correct value, but some notNull.data()[i] with 0.
we then add more debug code in orc c++ source code, below are log fragments:
a. at line 1132529 of log file, notNull.data() was init with all 1 at the beginning, here can see 4096 elements one time
b. at line 1133878 of log file, reading a new stripe for last_total=0, shows now we are using a new ColumReader which created in RowReaderImpl::startNextStripe()
c. in 1135312 of log file, shows the end of stripe reading
d. in 1135372 of log file, shows we are reading a new stripe which has 129 elements/records
the record matches condition 'where ss_item_sk=302314 and ss_customer_sk=11587351 and ss_quantity=24' was one of this 129 records, with a/b/c/d,we can make sure that notNull.data() should contain none 0 values, but it do have.
After more deeper annalize, the probem was found in ColumnReader::next, in case batch elements of last stripe include some NULL values, there were notNull.data()[i] with 0 at some positions.
When reading a new stripe which include no one NULL values, notNull.data() were not initialized, but with polluted values.
After simply add std::memset(notNull.data(), 1, numValues); for ColumnReader::next(ColumnVectorBatch& rowBatch, uint64_t numValues, char* incomingMask), the problem solved.
The text was updated successfully, but these errors were encountered: