Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release plan 3.4 #52573

Open
jaogoy opened this issue Nov 4, 2024 · 0 comments
Open

Release plan 3.4 #52573

jaogoy opened this issue Nov 4, 2024 · 0 comments
Labels
type/enhancement Make an enhancement to StarRocks type/feature-request

Comments

@jaogoy
Copy link
Contributor

jaogoy commented Nov 4, 2024

ETA: December 2024

Shared-data Enhancements

  1. Provides a Query cache, aligned with the Shared-nothing architecture.
  2. Supports synchronous materialized views, aligned with the Shared-nothing architecture.

Data Lake Analytics

  1. Optimizes Iceberg V2 query performance by reducing repeated reads of delete-files, lowering memory usage, and enhancing query performance.
  2. Provides Time Travel query capability for Iceberg, allowing data to be read from a specified BRANCH or TAG by specifying TIMESTAMP or VERSION.
  3. Supports Delta lake column mapping.
  4. Data Cache related improvements:
    1. Introduces a Segmented LRU (SLRU) Cache eviction strategy, which significantly defends against cache pollution from occasional large queries, improves cache hit rate, and reduces fluctuations in query performance.
    2. Unifies parameters for Data Cache in both Shared-data architecture and lake query scenarios.
    3. Provides an Adaptive IO strategy optimization for Data Cache, which adaptively routes some query requests to remote storage based on the cache disk's load and performance, thereby enhancing overall access throughput.
  5. Enables asynchronous delivery of query fragments for lake queries, reducing the restriction that FE must obtain all files to be queried before BE can execute the query, thus allowing FE to fetch query files and BE to execute queries in parallel, shortening the overall query latency in lake queries involving a large number of files. (Currently, optimizations for Hudi/Delta are completed, Iceberg's are not yet done).
  6. Supports automatic collection of external table statistics, which can collect more accurate NDV information compared to metadata files, thereby optimizing the query plan and improving query performance.

Performance Improvement and Query Optimization

  1. Provides Arrow Flight interface for more efficient reading of large data volumes in query results.
  2. [Experimental] Offers a preliminary query feedback feature for automatic optimization of slow queries. The system will collect slow queries and automatically analyze a SQL's Query Plan for potential optimization needs based on execution details, and may generate a tailored optimization guide. If the optimizer generates the same bad plan for subsequentidentical queries, the system may locally optimize this query plan to attempt to generate a better one.
  3. Enables the pushdown of multi-column OR predicates, allowing a SQL with multi-column OR conditions (e.g., a = xxx OR b = yyy) to utilize certain column indexes, thus reducing data read volume and improving query performance.
  4. Further Optimizes query performance for TPCDS, with a roughly 30% performance improvement in TPCDS-1TB Iceberg queries.
  5. Supports Python UDFs, offering more convenient function customization compared to Java UDFs.

Storage engine

  1. Provides unified expression partitioning, supporting arbitrary multi-level partitioning, where each level can be any expression.
  2. Introduces a generic aggregate function state storage framework, which, in addition to the originally supported aggregate functions like SUM/MIN/MAX, can now conveniently support almost all other aggregate functions.
  3. Supports Vector Index, offering two types of indexes: IVFPQ and HNSW, enabling fast approximate nearest neighbor searches (ANN) in large-scale, high-dimensional vectors, commonly required in deep learning and machine learning.
  4. In the Shared-nothing architecture, Backup/Restore now supports backing up more objects like Logical View, External Catalog, and also supports expression partitioning/List partitioning.
  5. Optimizes log printing to avoid occupying too much disk space.
  6. Accurately displays the status of BE/CN during a graceful exit.

Loading

  1. Provides a Batch Commit feature, which consolidates multiple concurrent Stream Loads of a table into a single ingestion transaction, thus improving the throughput of real-time data ingestion.
  2. INSERT OVERWRITE now supports automatic creation of partitions based on imported data and only overwrites partitions containing data, simplifying partial data recovery.
  3. Some data ingestion improvements for INSERT from FILES:
    • INSERT now supports matching columns by name (default is by position matching).
    • INSERT supports PROPERTIES to set some parameters, like strict_mode, max_filter_ratio, and timeout (strict_mode replaces enable_insert_strict, and differs slightly from it).
    • Enables pushdowning target table schema when using INSERT from FILES to infer a much accurate source data schema.
    • FILES now provides the ability to list files specified by the path parameter.
@jaogoy jaogoy added type/enhancement Make an enhancement to StarRocks type/feature-request labels Nov 4, 2024
@jaogoy jaogoy mentioned this issue Nov 4, 2024
61 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Make an enhancement to StarRocks type/feature-request
Projects
None yet
Development

No branches or pull requests

1 participant