Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](parquet) impl has_dict_page to replace old logic #45740

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

suxiaogang223
Copy link
Contributor

@suxiaogang223 suxiaogang223 commented Dec 20, 2024

What problem does this PR solve?

Problem Summary:

Checks if the given column has a dictionary page.

This function determines the presence of a dictionary page by checking the dictionary_page_offset field in the column metadata. The dictionary_page_offset must be set and greater than 0, and it must be less than the data_page_offset.

The reason for these checks is based on the implementation in the Java version of ORC, where dictionary_page_offset is used to indicate the absence of a dictionary. Additionally, Parquet may write an empty row group, in which case the dictionary page content would be empty, and thus the dictionary page should not be read.

See https://github.com/apache/arrow/pull/2667/files

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223 suxiaogang223 marked this pull request as draft December 20, 2024 10:15
@suxiaogang223 suxiaogang223 marked this pull request as ready for review December 22, 2024 09:45
@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 39714 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 31a3783edfe62caaaa3b54e377830c644ad68553, data reload: false

------ Round 1 ----------------------------------
q1	17605	7462	7256	7256
q2	2050	183	174	174
q3	10854	1122	1175	1122
q4	10549	766	697	697
q5	7616	2764	2738	2738
q6	241	148	145	145
q7	967	626	610	610
q8	9241	1834	1941	1834
q9	6609	6469	6433	6433
q10	7051	2290	2317	2290
q11	478	256	255	255
q12	440	214	220	214
q13	18067	3046	2906	2906
q14	250	207	210	207
q15	563	529	488	488
q16	668	588	580	580
q17	962	536	492	492
q18	7324	6746	6674	6674
q19	1331	1014	1082	1014
q20	466	186	180	180
q21	3999	3106	3282	3106
q22	384	317	299	299
Total cold run time: 107715 ms
Total hot run time: 39714 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7256	7202	7261	7202
q2	337	226	233	226
q3	2894	2812	2909	2812
q4	2059	1848	1792	1792
q5	5680	5640	5649	5640
q6	233	143	143	143
q7	2247	1856	1823	1823
q8	3408	3539	3504	3504
q9	8838	9076	8956	8956
q10	3608	3533	3548	3533
q11	596	504	542	504
q12	821	646	599	599
q13	11402	3138	3119	3119
q14	311	306	276	276
q15	554	504	524	504
q16	695	630	640	630
q17	1857	1665	1594	1594
q18	8413	7866	7734	7734
q19	1735	1435	1503	1435
q20	2076	1893	1891	1891
q21	5565	5506	5374	5374
q22	666	617	589	589
Total cold run time: 71251 ms
Total hot run time: 59880 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 196936 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 31a3783edfe62caaaa3b54e377830c644ad68553, data reload: false

query1	1317	930	929	929
query2	6241	2432	2319	2319
query3	10962	4755	4690	4690
query4	32975	23393	23458	23393
query5	4048	456	444	444
query6	298	183	182	182
query7	3987	308	317	308
query8	313	263	237	237
query9	9641	2711	2711	2711
query10	460	255	253	253
query11	17732	14998	15000	14998
query12	155	109	105	105
query13	1599	426	414	414
query14	9386	6780	7809	6780
query15	299	200	218	200
query16	8200	420	426	420
query17	1778	612	560	560
query18	2175	314	329	314
query19	364	160	159	159
query20	118	120	116	116
query21	202	104	107	104
query22	4770	4572	4668	4572
query23	34844	34296	34086	34086
query24	10316	2684	2527	2527
query25	629	413	390	390
query26	1276	162	151	151
query27	2340	339	334	334
query28	7515	2485	2434	2434
query29	879	415	420	415
query30	220	149	150	149
query31	1036	831	801	801
query32	98	61	58	58
query33	772	299	307	299
query34	1156	531	537	531
query35	913	760	770	760
query36	1128	962	937	937
query37	141	77	76	76
query38	4246	4182	4249	4182
query39	1490	1484	1443	1443
query40	207	104	98	98
query41	45	40	43	40
query42	114	108	104	104
query43	551	499	507	499
query44	1348	856	853	853
query45	196	176	172	172
query46	1221	750	727	727
query47	2043	1947	1966	1947
query48	466	345	325	325
query49	942	406	380	380
query50	835	394	396	394
query51	7437	7126	7172	7126
query52	108	92	93	92
query53	271	188	184	184
query54	1136	414	423	414
query55	84	84	85	84
query56	273	291	254	254
query57	1290	1135	1168	1135
query58	251	235	215	215
query59	3357	3343	3293	3293
query60	276	247	251	247
query61	110	126	139	126
query62	885	699	722	699
query63	234	197	197	197
query64	3961	749	657	657
query65	3312	3238	3300	3238
query66	777	311	305	305
query67	16590	15601	15713	15601
query68	5386	562	569	562
query69	506	259	252	252
query70	1224	1153	1151	1151
query71	473	257	250	250
query72	7153	4129	4134	4129
query73	807	360	365	360
query74	10246	8881	8828	8828
query75	3712	2644	2631	2631
query76	3752	1001	1173	1001
query77	581	276	277	276
query78	10337	9423	9394	9394
query79	1939	608	609	608
query80	789	429	412	412
query81	517	229	234	229
query82	691	116	117	116
query83	163	147	146	146
query84	246	68	75	68
query85	1334	310	305	305
query86	399	309	300	300
query87	4493	4660	4344	4344
query88	3646	2253	2215	2215
query89	436	302	286	286
query90	1973	190	185	185
query91	140	106	107	106
query92	64	52	50	50
query93	1936	549	542	542
query94	760	289	281	281
query95	352	249	248	248
query96	638	284	285	284
query97	2877	2670	2699	2670
query98	218	204	202	202
query99	1607	1318	1365	1318
Total cold run time: 303734 ms
Total hot run time: 196936 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.5 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 31a3783edfe62caaaa3b54e377830c644ad68553, data reload: false

query1	0.03	0.03	0.03
query2	0.09	0.03	0.03
query3	0.23	0.07	0.07
query4	1.60	0.10	0.10
query5	0.44	0.42	0.41
query6	1.16	0.64	0.67
query7	0.02	0.02	0.02
query8	0.04	0.04	0.04
query9	0.57	0.50	0.49
query10	0.57	0.57	0.56
query11	0.13	0.10	0.10
query12	0.14	0.11	0.10
query13	0.60	0.60	0.61
query14	2.87	2.86	2.72
query15	0.90	0.83	0.83
query16	0.39	0.40	0.38
query17	1.02	0.99	1.06
query18	0.23	0.21	0.20
query19	1.86	1.83	1.94
query20	0.01	0.01	0.01
query21	15.36	0.55	0.57
query22	2.89	2.54	1.84
query23	16.88	1.10	0.69
query24	3.12	1.03	1.60
query25	0.29	0.20	0.07
query26	0.48	0.14	0.14
query27	0.05	0.04	0.03
query28	10.24	1.10	1.10
query29	12.59	3.19	3.17
query30	0.25	0.06	0.06
query31	2.85	0.40	0.39
query32	3.23	0.46	0.47
query33	3.07	3.22	3.16
query34	17.07	4.49	4.44
query35	4.48	4.44	4.49
query36	0.68	0.51	0.48
query37	0.09	0.06	0.06
query38	0.04	0.04	0.03
query39	0.03	0.02	0.03
query40	0.16	0.12	0.12
query41	0.08	0.03	0.02
query42	0.04	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 106.91 s
Total hot run time: 32.5 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants