Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support skip/limit options for pandas scan #4662

Merged
merged 2 commits into from
Jan 7, 2025

Conversation

royi-luo
Copy link
Collaborator

Description

Add support for skipping rows + limiting number of rows to scan when scanning from pandas dataframes.
Also reuse PyarrowScanConfig for pandas scan config and rename to more appropriate name (since it isn't pyarrow only)

Contributor agreement

Sorry, something went wrong.

@royi-luo royi-luo self-assigned this Dec 20, 2024
Copy link

codecov bot commented Dec 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.13%. Comparing base (0c6a394) to head (9741d88).
Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4662      +/-   ##
==========================================
- Coverage   86.13%   86.13%   -0.01%     
==========================================
  Files        1373     1373              
  Lines       58295    58295              
  Branches     7210     7211       +1     
==========================================
- Hits        50212    50210       -2     
- Misses       7919     7921       +2     
  Partials      164      164              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

Benchmark Result

Master commit hash: 86f1f4310edbd877f8db75800c6d1ac04d1e058d
Branch commit hash: 742d29ea897eb455fdf9635ba4a6462b7aa72d87

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 672.66 647.17 25.49 (3.94%)
aggregation q28 11565.36 11603.98 -38.62 (-0.33%)
filter q14 127.07 125.41 1.66 (1.32%)
filter q15 124.60 126.68 -2.08 (-1.64%)
filter q16 315.35 298.79 16.56 (5.54%)
filter q17 446.92 447.59 -0.67 (-0.15%)
filter q18 1898.34 1944.30 -45.96 (-2.36%)
filter zonemap-node 88.81 87.27 1.54 (1.77%)
filter zonemap-node-lhs-cast 88.21 87.44 0.78 (0.89%)
filter zonemap-node-null 84.34 N/A N/A
filter zonemap-rel 5687.44 5709.27 -21.83 (-0.38%)
fixed_size_expr_evaluator q07 571.07 581.07 -10.01 (-1.72%)
fixed_size_expr_evaluator q08 798.75 809.69 -10.94 (-1.35%)
fixed_size_expr_evaluator q09 802.63 810.57 -7.94 (-0.98%)
fixed_size_expr_evaluator q10 238.62 244.13 -5.51 (-2.26%)
fixed_size_expr_evaluator q11 230.56 235.72 -5.17 (-2.19%)
fixed_size_expr_evaluator q12 225.61 237.23 -11.62 (-4.90%)
fixed_size_expr_evaluator q13 1459.21 1456.84 2.37 (0.16%)
fixed_size_seq_scan q23 110.12 116.84 -6.72 (-5.75%)
join q29 584.25 637.69 -53.44 (-8.38%)
join q30 1568.08 1560.85 7.23 (0.46%)
join q31 6.12 5.07 1.05 (20.69%)
join SelectiveTwoHopJoin 52.90 56.33 -3.42 (-6.07%)
ldbc_snb_ic q35 2627.34 2642.46 -15.12 (-0.57%)
ldbc_snb_ic q36 538.30 537.18 1.12 (0.21%)
ldbc_snb_is q32 5.78 5.16 0.63 (12.12%)
ldbc_snb_is q33 14.87 15.62 -0.74 (-4.77%)
ldbc_snb_is q34 1.27 1.08 0.19 (17.94%)
multi-rel multi-rel-large-scan 1182.83 1234.16 -51.33 (-4.16%)
multi-rel multi-rel-lookup 31.99 21.75 10.24 (47.10%)
multi-rel multi-rel-small-scan 66.81 104.18 -37.37 (-35.87%)
order_by q25 128.39 131.89 -3.50 (-2.65%)
order_by q26 449.79 449.61 0.18 (0.04%)
order_by q27 1454.95 1474.94 -19.98 (-1.35%)
recursive_join recursive-join-bidirection 293.33 298.76 -5.43 (-1.82%)
recursive_join recursive-join-dense 7386.10 7364.18 21.92 (0.30%)
recursive_join recursive-join-path 23637.97 23804.54 -166.57 (-0.70%)
recursive_join recursive-join-sparse 14513.71 14748.68 -234.97 (-1.59%)
recursive_join recursive-join-trail 7292.17 7293.80 -1.64 (-0.02%)
scan_after_filter q01 170.71 169.83 0.88 (0.52%)
scan_after_filter q02 156.89 167.27 -10.38 (-6.21%)
shortest_path_ldbc100 q37 87.73 95.98 -8.25 (-8.60%)
shortest_path_ldbc100 q38 364.45 348.01 16.45 (4.73%)
shortest_path_ldbc100 q39 61.37 64.77 -3.40 (-5.25%)
shortest_path_ldbc100 q40 438.08 421.22 16.87 (4.00%)
var_size_expr_evaluator q03 2065.91 2078.72 -12.82 (-0.62%)
var_size_expr_evaluator q04 2199.12 2232.03 -32.91 (-1.47%)
var_size_expr_evaluator q05 2684.76 2651.46 33.30 (1.26%)
var_size_expr_evaluator q06 1342.34 1332.11 10.23 (0.77%)
var_size_seq_scan q19 1450.45 1457.11 -6.66 (-0.46%)
var_size_seq_scan q20 2693.46 2716.34 -22.88 (-0.84%)
var_size_seq_scan q21 2287.12 2292.11 -4.99 (-0.22%)
var_size_seq_scan q22 128.71 129.58 -0.87 (-0.67%)

@royi-luo royi-luo force-pushed the royi/add-skip-limit-pandas-scan branch 3 times, most recently from 7bf8f04 to 63d7fd1 Compare January 6, 2025 22:31
@royi-luo royi-luo force-pushed the royi/add-skip-limit-pandas-scan branch from 63d7fd1 to b709ea0 Compare January 6, 2025 22:32
Copy link

github-actions bot commented Jan 6, 2025

Benchmark Result

Master commit hash: 606eef8e2884694e3fa767045e5685b01daa62f0
Branch commit hash: 1183d889f82c9247651018e6ad405093eb0f58dc

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 654.30 662.73 -8.43 (-1.27%)
aggregation q28 11372.91 12127.88 -754.97 (-6.23%)
filter q14 134.93 131.13 3.80 (2.90%)
filter q15 134.17 126.91 7.26 (5.72%)
filter q16 314.09 311.89 2.21 (0.71%)
filter q17 454.68 455.26 -0.58 (-0.13%)
filter q18 1918.83 1907.55 11.27 (0.59%)
filter zonemap-node 97.55 89.66 7.89 (8.80%)
filter zonemap-node-lhs-cast 97.13 89.98 7.15 (7.95%)
filter zonemap-node-null 92.91 85.99 6.93 (8.05%)
filter zonemap-rel 5704.97 5492.81 212.16 (3.86%)
fixed_size_expr_evaluator q07 588.08 570.44 17.64 (3.09%)
fixed_size_expr_evaluator q08 818.99 799.48 19.50 (2.44%)
fixed_size_expr_evaluator q09 818.97 801.30 17.68 (2.21%)
fixed_size_expr_evaluator q10 383.33 236.33 147.00 (62.20%)
fixed_size_expr_evaluator q11 245.34 228.83 16.51 (7.21%)
fixed_size_expr_evaluator q12 242.12 225.84 16.28 (7.21%)
fixed_size_expr_evaluator q13 1469.73 1458.77 10.96 (0.75%)
fixed_size_seq_scan q23 127.25 113.50 13.74 (12.11%)
join q29 616.88 616.17 0.71 (0.12%)
join q30 10483.90 10279.23 204.68 (1.99%)
join q31 7.34 4.04 3.30 (81.71%)
join SelectiveTwoHopJoin 57.51 45.94 11.57 (25.18%)
ldbc_snb_ic q35 2531.04 2635.27 -104.24 (-3.96%)
ldbc_snb_ic q36 487.81 445.44 42.37 (9.51%)
ldbc_snb_is q32 6.55 5.54 1.01 (18.14%)
ldbc_snb_is q33 16.52 12.56 3.96 (31.52%)
ldbc_snb_is q34 1.35 0.85 0.49 (58.13%)
multi-rel multi-rel-large-scan 1362.50 1306.46 56.04 (4.29%)
multi-rel multi-rel-lookup 35.13 28.77 6.36 (22.11%)
multi-rel multi-rel-small-scan 76.41 81.08 -4.67 (-5.76%)
order_by q25 141.77 130.57 11.20 (8.58%)
order_by q26 461.36 459.94 1.42 (0.31%)
order_by q27 1472.88 1459.41 13.46 (0.92%)
recursive_join recursive-join-bidirection 325.75 272.92 52.83 (19.36%)
recursive_join recursive-join-dense 7348.79 7364.98 -16.19 (-0.22%)
recursive_join recursive-join-path 23974.56 23990.40 -15.85 (-0.07%)
recursive_join recursive-join-sparse 1056.29 1079.47 -23.17 (-2.15%)
recursive_join recursive-join-trail 7300.97 7297.36 3.61 (0.05%)
scan_after_filter q01 179.49 170.64 8.84 (5.18%)
scan_after_filter q02 166.43 157.36 9.07 (5.76%)
shortest_path_ldbc100 q37 89.99 82.64 7.35 (8.90%)
shortest_path_ldbc100 q38 318.84 369.53 -50.69 (-13.72%)
shortest_path_ldbc100 q39 62.57 62.24 0.33 (0.53%)
shortest_path_ldbc100 q40 430.70 461.72 -31.02 (-6.72%)
var_size_expr_evaluator q03 2082.42 2086.20 -3.77 (-0.18%)
var_size_expr_evaluator q04 2197.19 2170.41 26.78 (1.23%)
var_size_expr_evaluator q05 2617.25 2574.80 42.45 (1.65%)
var_size_expr_evaluator q06 1334.41 1322.89 11.52 (0.87%)
var_size_seq_scan q19 1453.92 1446.66 7.26 (0.50%)
var_size_seq_scan q20 2644.37 2358.30 286.07 (12.13%)
var_size_seq_scan q21 2273.58 2272.10 1.48 (0.07%)
var_size_seq_scan q22 128.98 124.85 4.14 (3.31%)

@royi-luo royi-luo marked this pull request as ready for review January 6, 2025 23:24
@royi-luo royi-luo requested a review from ray6080 January 6, 2025 23:24
Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Please check my comment below.

std::move(columnBindData), std::move(scanConfig));
auto scanConfig = PyScanConfig{
input->extraInput->constPtrCast<ExtraScanTableFuncBindInput>()->fileScanInfo.options};
KU_ASSERT(numRows >= scanConfig.skipNum);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm Do we have a check somewhere to ensure scanConfig.skipNum is always smaller than numRows? I'm concerned if this will lead to overflow under the case numRows < scanConfig.skipNum.

I would also add a test case on that if we don't have one already.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added check in scan config constructor that bounds skipNum to the number of rows + added tests

Copy link

github-actions bot commented Jan 7, 2025

Benchmark Result

Master commit hash: 0c6a394a32b9092e874b144516d9366a706e7d0b
Branch commit hash: 38e0c0f1307251c69cb8f21c5b5ff7f52dbce533

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 652.86 666.67 -13.81 (-2.07%)
aggregation q28 11997.19 11065.46 931.73 (8.42%)
filter q14 134.38 135.68 -1.30 (-0.95%)
filter q15 135.06 137.85 -2.78 (-2.02%)
filter q16 311.75 311.89 -0.14 (-0.05%)
filter q17 457.65 458.85 -1.21 (-0.26%)
filter q18 1930.44 1951.28 -20.84 (-1.07%)
filter zonemap-node 96.68 97.66 -0.98 (-1.01%)
filter zonemap-node-lhs-cast 96.86 99.19 -2.33 (-2.35%)
filter zonemap-node-null 93.39 95.14 -1.74 (-1.83%)
filter zonemap-rel 5697.17 5771.06 -73.90 (-1.28%)
fixed_size_expr_evaluator q07 588.26 598.74 -10.48 (-1.75%)
fixed_size_expr_evaluator q08 817.55 820.53 -2.99 (-0.36%)
fixed_size_expr_evaluator q09 817.24 830.88 -13.64 (-1.64%)
fixed_size_expr_evaluator q10 251.79 257.47 -5.68 (-2.21%)
fixed_size_expr_evaluator q11 245.09 251.51 -6.42 (-2.55%)
fixed_size_expr_evaluator q12 242.70 247.01 -4.32 (-1.75%)
fixed_size_expr_evaluator q13 1469.13 1487.80 -18.67 (-1.25%)
fixed_size_seq_scan q23 129.30 134.46 -5.16 (-3.84%)
join q29 614.88 618.88 -4.00 (-0.65%)
join q30 10280.57 10218.46 62.11 (0.61%)
join q31 5.00 8.03 -3.03 (-37.77%)
join SelectiveTwoHopJoin 54.56 53.38 1.18 (2.21%)
ldbc_snb_ic q35 2594.25 2584.14 10.11 (0.39%)
ldbc_snb_ic q36 456.79 437.49 19.30 (4.41%)
ldbc_snb_is q32 6.61 7.15 -0.54 (-7.57%)
ldbc_snb_is q33 16.33 17.29 -0.96 (-5.57%)
ldbc_snb_is q34 1.37 1.32 0.05 (3.74%)
multi-rel multi-rel-large-scan 1364.32 1370.98 -6.66 (-0.49%)
multi-rel multi-rel-lookup 11.69 21.75 -10.06 (-46.25%)
multi-rel multi-rel-small-scan 84.95 93.15 -8.21 (-8.81%)
order_by q25 138.81 140.08 -1.27 (-0.91%)
order_by q26 463.91 466.46 -2.54 (-0.55%)
order_by q27 1468.20 1488.59 -20.39 (-1.37%)
recursive_join recursive-join-bidirection 288.85 290.18 -1.33 (-0.46%)
recursive_join recursive-join-dense 7361.69 7491.91 -130.23 (-1.74%)
recursive_join recursive-join-path 23824.52 24141.69 -317.17 (-1.31%)
recursive_join recursive-join-sparse 1057.85 1075.83 -17.98 (-1.67%)
recursive_join recursive-join-trail 7279.69 7408.65 -128.96 (-1.74%)
scan_after_filter q01 179.75 183.67 -3.92 (-2.13%)
scan_after_filter q02 167.47 169.27 -1.81 (-1.07%)
shortest_path_ldbc100 q37 87.53 84.03 3.49 (4.16%)
shortest_path_ldbc100 q38 295.43 373.50 -78.07 (-20.90%)
shortest_path_ldbc100 q39 60.49 59.41 1.08 (1.81%)
shortest_path_ldbc100 q40 455.09 415.67 39.42 (9.48%)
var_size_expr_evaluator q03 2099.34 2096.91 2.44 (0.12%)
var_size_expr_evaluator q04 2223.25 2260.65 -37.40 (-1.65%)
var_size_expr_evaluator q05 2626.87 2665.41 -38.53 (-1.45%)
var_size_expr_evaluator q06 1332.58 1349.59 -17.01 (-1.26%)
var_size_seq_scan q19 1456.63 1483.26 -26.63 (-1.80%)
var_size_seq_scan q20 2644.62 2691.53 -46.91 (-1.74%)
var_size_seq_scan q21 2273.25 2302.03 -28.78 (-1.25%)
var_size_seq_scan q22 129.76 129.64 0.12 (0.09%)

@royi-luo royi-luo merged commit 6caaad7 into master Jan 7, 2025
23 of 25 checks passed
@royi-luo royi-luo deleted the royi/add-skip-limit-pandas-scan branch January 7, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants