Release plan 3.4 #52573

jaogoy · 2024-11-04T03:00:21Z

ETA: December 2024

Shared-data Enhancements

Provides a Query cache, aligned with the Shared-nothing architecture.
Supports synchronous materialized views, aligned with the Shared-nothing architecture.

Data Lake Analytics

Optimizes Iceberg V2 query performance by reducing repeated reads of delete-files, lowering memory usage, and enhancing query performance.
Provides Time Travel query capability for Iceberg, allowing data to be read from a specified BRANCH or TAG by specifying TIMESTAMP or VERSION.
Supports Delta lake column mapping.
Data Cache related improvements:
1. Introduces a Segmented LRU (SLRU) Cache eviction strategy, which significantly defends against cache pollution from occasional large queries, improves cache hit rate, and reduces fluctuations in query performance.
2. Unifies parameters for Data Cache in both Shared-data architecture and lake query scenarios.
3. Provides an Adaptive IO strategy optimization for Data Cache, which adaptively routes some query requests to remote storage based on the cache disk's load and performance, thereby enhancing overall access throughput.
Enables asynchronous delivery of query fragments for lake queries, reducing the restriction that FE must obtain all files to be queried before BE can execute the query, thus allowing FE to fetch query files and BE to execute queries in parallel, shortening the overall query latency in lake queries involving a large number of files. (Currently, optimizations for Hudi/Delta are completed, Iceberg's are not yet done).
Supports automatic collection of external table statistics, which can collect more accurate NDV information compared to metadata files, thereby optimizing the query plan and improving query performance.

Performance Improvement and Query Optimization

Provides Arrow Flight interface for more efficient reading of large data volumes in query results.
[Experimental] Offers a preliminary query feedback feature for automatic optimization of slow queries. The system will collect slow queries and automatically analyze a SQL's Query Plan for potential optimization needs based on execution details, and may generate a tailored optimization guide. If the optimizer generates the same bad plan for subsequentidentical queries, the system may locally optimize this query plan to attempt to generate a better one.
Enables the pushdown of multi-column OR predicates, allowing a SQL with multi-column OR conditions (e.g., a = xxx OR b = yyy) to utilize certain column indexes, thus reducing data read volume and improving query performance.
Further Optimizes query performance for TPCDS, with a roughly 30% performance improvement in TPCDS-1TB Iceberg queries.
Supports Python UDFs, offering more convenient function customization compared to Java UDFs.

Storage engine

Provides unified expression partitioning, supporting arbitrary multi-level partitioning, where each level can be any expression.
Introduces a generic aggregate function state storage framework, which, in addition to the originally supported aggregate functions like SUM/MIN/MAX, can now conveniently support almost all other aggregate functions.
Supports Vector Index, offering two types of indexes: IVFPQ and HNSW, enabling fast approximate nearest neighbor searches (ANN) in large-scale, high-dimensional vectors, commonly required in deep learning and machine learning.
In the Shared-nothing architecture, Backup/Restore now supports backing up more objects like Logical View, External Catalog, and also supports expression partitioning/List partitioning.
Optimizes log printing to avoid occupying too much disk space.
Accurately displays the status of BE/CN during a graceful exit.

Loading

Provides a Batch Commit feature, which consolidates multiple concurrent Stream Loads of a table into a single ingestion transaction, thus improving the throughput of real-time data ingestion.
INSERT OVERWRITE now supports automatic creation of partitions based on imported data and only overwrites partitions containing data, simplifying partial data recovery.
Some data ingestion improvements for INSERT from FILES:
- INSERT now supports matching columns by name (default is by position matching).
- INSERT supports PROPERTIES to set some parameters, like strict_mode, max_filter_ratio, and timeout (strict_mode replaces enable_insert_strict, and differs slightly from it).
- Enables pushdowning target table schema when using INSERT from FILES to infer a much accurate source data schema.
- FILES now provides the ability to list files specified by the path parameter.

The text was updated successfully, but these errors were encountered:

jaogoy added type/enhancement Make an enhancement to StarRocks type/feature-request labels Nov 4, 2024

jaogoy mentioned this issue Nov 4, 2024

StarRocks Roadmap 2024 #39686

Open

61 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release plan 3.4 #52573

Release plan 3.4 #52573

jaogoy commented Nov 4, 2024

Release plan 3.4 #52573

Release plan 3.4 #52573

Comments

jaogoy commented Nov 4, 2024

Shared-data Enhancements

Data Lake Analytics

Performance Improvement and Query Optimization

Storage engine

Loading