diff --git a/CHANGELOG.md b/CHANGELOG.md index ed77f4bad26e..26474afd69c4 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -20,7 +20,7 @@ Changelogs are maintained separately for each subproject. Please check out the changelog file within each subproject folder for more details: -* [Datafusion CHANGELOG](./datafusion/CHANGELOG.md) +* [DataFusion CHANGELOG](./datafusion/CHANGELOG.md) * [Ballista CHANGELOG](./ballista/CHANGELOG.md) For older versions, see [apache/arrow/CHANGELOG.md](https://github.com/apache/arrow/blob/master/CHANGELOG.md). diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index a2451d1f2c3a..e0cad120419b 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -132,7 +132,7 @@ python -m pytest -v integration-tests/test_psql_parity.py ### Criterion Benchmarks -[Criterion](https://docs.rs/criterion/latest/criterion/index.html) is a statistics-driven micro-benchmarking framework used by Datafusion for evaluating the performance of specific code-paths. In particular, the criterion benchmarks help to both guide optimisation efforts, and prevent performance regressions within Datafusion. +[Criterion](https://docs.rs/criterion/latest/criterion/index.html) is a statistics-driven micro-benchmarking framework used by DataFusion for evaluating the performance of specific code-paths. In particular, the criterion benchmarks help to both guide optimisation efforts, and prevent performance regressions within DataFusion. Criterion integrates with Cargo's built-in [benchmark support](https://doc.rust-lang.org/cargo/commands/cargo-bench.html) and a given benchmark can be run with @@ -160,7 +160,7 @@ The benchmark will automatically remove any generated parquet file on exit, howe ### Upstream Benchmark Suites -Instructions and tooling for running upstream benchmark suites against Datafusion and/or Ballista can be found in [benchmarks](./benchmarks). +Instructions and tooling for running upstream benchmark suites against DataFusion and/or Ballista can be found in [benchmarks](./benchmarks). These are valuable for comparative evaluation against alternative Arrow implementations and query engines. @@ -227,7 +227,7 @@ dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf ## Specification -We formalize Datafusion semantics and behaviors through specification +We formalize DataFusion semantics and behaviors through specification documents. These specifications are useful to be used as references to help resolve ambiguities during development or code reviews. diff --git a/ballista/rust/core/src/execution_plans/mod.rs b/ballista/rust/core/src/execution_plans/mod.rs index b10ff341e903..7a5e105c6c4a 100644 --- a/ballista/rust/core/src/execution_plans/mod.rs +++ b/ballista/rust/core/src/execution_plans/mod.rs @@ -15,7 +15,7 @@ // specific language governing permissions and limitations // under the License. -//! This module contains execution plans that are needed to distribute Datafusion's execution plans into +//! This module contains execution plans that are needed to distribute DataFusion's execution plans into //! several Ballista executors. mod distributed_query; diff --git a/conbench/benchmarks.py b/conbench/benchmarks.py index 9ad3e314ee4e..f80b3add90f9 100644 --- a/conbench/benchmarks.py +++ b/conbench/benchmarks.py @@ -38,4 +38,4 @@ def _f(self): @conbench.runner.register_benchmark class CargoBenchmarks(_criterion.CriterionBenchmark): name = "datafusion" - description = "Run Arrow Datafusion micro benchmarks." + description = "Run Arrow DataFusion micro benchmarks." diff --git a/datafusion-physical-expr/src/expressions/cast.rs b/datafusion-physical-expr/src/expressions/cast.rs index 2870c9b2654d..9144acc405e3 100644 --- a/datafusion-physical-expr/src/expressions/cast.rs +++ b/datafusion-physical-expr/src/expressions/cast.rs @@ -30,7 +30,7 @@ use datafusion_common::ScalarValue; use datafusion_common::{DataFusionError, Result}; use datafusion_expr::ColumnarValue; -/// provide Datafusion default cast options +/// provide DataFusion default cast options pub const DEFAULT_DATAFUSION_CAST_OPTIONS: CastOptions = CastOptions { safe: false }; /// CAST expression casts an expression to a specific data type and returns a runtime error on invalid cast diff --git a/datafusion/CHANGELOG.md b/datafusion/CHANGELOG.md index bba7c28dcb81..ad3d09c6a839 100644 --- a/datafusion/CHANGELOG.md +++ b/datafusion/CHANGELOG.md @@ -32,7 +32,7 @@ - Remove non idiomatic `DataFusionError::into_arrow_external_error` in favor of From conversion [\#1645](https://github.com/apache/arrow-datafusion/pull/1645) ([alamb](https://github.com/alamb)) - Remove `Accumulator::update` and `Accumulator::merge` [\#1582](https://github.com/apache/arrow-datafusion/pull/1582) ([Jimexist](https://github.com/Jimexist)) - implement `Hash` for various types and replace `PartialOrd` [\#1580](https://github.com/apache/arrow-datafusion/pull/1580) ([Jimexist](https://github.com/Jimexist)) -- Replace `DatafusionError` with `GenericError` in `ObjectStore` interface [\#1541](https://github.com/apache/arrow-datafusion/pull/1541) ([matthewmturner](https://github.com/matthewmturner)) +- Replace `DataFusionError` with `GenericError` in `ObjectStore` interface [\#1541](https://github.com/apache/arrow-datafusion/pull/1541) ([matthewmturner](https://github.com/matthewmturner)) - Make `FLOAT` SQL type map to `Float32` rather than `Float64` [\#1423](https://github.com/apache/arrow-datafusion/pull/1423) [[sql](https://github.com/apache/arrow-datafusion/labels/sql)] ([liukun4515](https://github.com/liukun4515)) - Map `REAL` SQL type to `Float32` rather than `Float64` to be consistent with pg [\#1390](https://github.com/apache/arrow-datafusion/pull/1390) [[sql](https://github.com/apache/arrow-datafusion/labels/sql)] ([hntd187](https://github.com/hntd187)) @@ -79,7 +79,7 @@ - Add support for `ORDER BY` on unprojected columns [\#1415](https://github.com/apache/arrow-datafusion/pull/1415) ([viirya](https://github.com/viirya)) - Support decimal for `min` and `max` aggregate [\#1407](https://github.com/apache/arrow-datafusion/pull/1407) ([liukun4515](https://github.com/liukun4515)) - Consolidate `ConstantFolding` and `SimplifyExpression` [\#1375](https://github.com/apache/arrow-datafusion/pull/1375) ([alamb](https://github.com/alamb)) -- Datafusion cli quiet mode command to contain option bool [\#1345](https://github.com/apache/arrow-datafusion/pull/1345) ([Jimexist](https://github.com/Jimexist)) +- DataFusion cli quiet mode command to contain option bool [\#1345](https://github.com/apache/arrow-datafusion/pull/1345) ([Jimexist](https://github.com/Jimexist)) - Implement `array_agg` aggregate function [\#1300](https://github.com/apache/arrow-datafusion/pull/1300) ([viirya](https://github.com/viirya)) - Add a command to switch output format in cli [\#1284](https://github.com/apache/arrow-datafusion/pull/1284) ([capkurmagati](https://github.com/capkurmagati)) - Support `=`, `<`, `<=`, `>`, `>=`, `!=`, `is distinct from`, `is not distinct from` for `BooleanArray` [\#1163](https://github.com/apache/arrow-datafusion/pull/1163) ([alamb](https://github.com/alamb)) @@ -94,7 +94,7 @@ - CTE/WITH .. UNION ALL confuses name resolution in WHERE [\#1509](https://github.com/apache/arrow-datafusion/issues/1509) - ORDER BY min\(x\) results in error `Plan("No field named 'foo.x'. Valid fields are 'MIN(foo.x)'.")` [\#1479](https://github.com/apache/arrow-datafusion/issues/1479) - Sort discards field metadata on the output schema [\#1476](https://github.com/apache/arrow-datafusion/issues/1476) -- Datafusion should not strip out timezone information from existing types [\#1454](https://github.com/apache/arrow-datafusion/issues/1454) +- DataFusion should not strip out timezone information from existing types [\#1454](https://github.com/apache/arrow-datafusion/issues/1454) - Error on some queries: "column types must match schema types, expected XXX but found YYY" [\#1447](https://github.com/apache/arrow-datafusion/issues/1447) - Query failing to return any results when filter is an equality check on strings \(bad statistics in parquet\) [\#1433](https://github.com/apache/arrow-datafusion/issues/1433) - Field names containing period such as `f.c1` cannot be named in SQL query [\#1432](https://github.com/apache/arrow-datafusion/issues/1432) @@ -111,7 +111,7 @@ - Fix single\_distinct\_to\_groupby for arbitrary expressions [\#1519](https://github.com/apache/arrow-datafusion/pull/1519) ([james727](https://github.com/james727)) - Fix SortExec discards field metadata on the output schema [\#1477](https://github.com/apache/arrow-datafusion/pull/1477) ([alamb](https://github.com/alamb)) - fix calculate in many\_to\_many\_hash\_partition test. [\#1463](https://github.com/apache/arrow-datafusion/pull/1463) ([Ted-Jiang](https://github.com/Ted-Jiang)) -- Add Timezone to Scalar::Time\* types, and better timezone awareness to Datafusion's time types [\#1455](https://github.com/apache/arrow-datafusion/pull/1455) ([maxburke](https://github.com/maxburke)) +- Add Timezone to Scalar::Time\* types, and better timezone awareness to DataFusion's time types [\#1455](https://github.com/apache/arrow-datafusion/pull/1455) ([maxburke](https://github.com/maxburke)) - Support identifiers with `.` in them [\#1449](https://github.com/apache/arrow-datafusion/pull/1449) [[sql](https://github.com/apache/arrow-datafusion/labels/sql)] ([alamb](https://github.com/alamb)) - Fixes for working with functions in dataframes, additional documentation [\#1430](https://github.com/apache/arrow-datafusion/pull/1430) ([tobyhede](https://github.com/tobyhede)) - \[Minor\] Fix `send_time` metric for hash-repartition [\#1421](https://github.com/apache/arrow-datafusion/pull/1421) ([Dandandan](https://github.com/Dandandan)) @@ -130,7 +130,7 @@ - Clarify docs about `Accumulator::update` and `Accumulator::update_batch` [\#1542](https://github.com/apache/arrow-datafusion/pull/1542) ([alamb](https://github.com/alamb)) - Fix duplicated `cargo run --example parquet_sql` [\#1482](https://github.com/apache/arrow-datafusion/pull/1482) ([sergey-melnychuk](https://github.com/sergey-melnychuk)) -- add documentation to Datafusion cli's new commands [\#1348](https://github.com/apache/arrow-datafusion/pull/1348) ([liukun4515](https://github.com/liukun4515)) +- add documentation to DataFusion cli's new commands [\#1348](https://github.com/apache/arrow-datafusion/pull/1348) ([liukun4515](https://github.com/liukun4515)) - fix some clippy warnings from nightly channel [\#1277](https://github.com/apache/arrow-datafusion/pull/1277) [[sql](https://github.com/apache/arrow-datafusion/labels/sql)] ([Jimexist](https://github.com/Jimexist)) **Performance improvements:** @@ -470,7 +470,7 @@ - delete redundant code [\#973](https://github.com/apache/arrow-datafusion/issues/973) - How to build DataFusion python wheel [\#853](https://github.com/apache/arrow-datafusion/issues/853) - Add support for partition pruning [\#204](https://github.com/apache/arrow-datafusion/issues/204) -- \[Datafusion\] Support joins on TimestampMillisecond columns [\#187](https://github.com/apache/arrow-datafusion/issues/187) +- \[DataFusion\] Support joins on TimestampMillisecond columns [\#187](https://github.com/apache/arrow-datafusion/issues/187) - TPC-H Query 21 [\#173](https://github.com/apache/arrow-datafusion/issues/173) - TPC-H Query 13 [\#164](https://github.com/apache/arrow-datafusion/issues/164) - TPC-H Query 8 [\#162](https://github.com/apache/arrow-datafusion/issues/162) @@ -509,7 +509,7 @@ For older versions, see [apache/arrow/CHANGELOG.md](https://github.com/apache/ar - Box ScalarValue:Lists, reduce size by half size [\#788](https://github.com/apache/arrow-datafusion/pull/788) ([alamb](https://github.com/alamb)) - JOIN conditions are order dependent [\#778](https://github.com/apache/arrow-datafusion/pull/778) ([seddonm1](https://github.com/seddonm1)) - Show the result of all optimizer passes in EXPLAIN VERBOSE [\#759](https://github.com/apache/arrow-datafusion/pull/759) ([alamb](https://github.com/alamb)) -- \#723 Datafusion add option in ExecutionConfig to enable/disable parquet pruning [\#749](https://github.com/apache/arrow-datafusion/pull/749) ([lvheyang](https://github.com/lvheyang)) +- \#723 DataFusion add option in ExecutionConfig to enable/disable parquet pruning [\#749](https://github.com/apache/arrow-datafusion/pull/749) ([lvheyang](https://github.com/lvheyang)) - Update API for extension planning to include logical plan [\#643](https://github.com/apache/arrow-datafusion/pull/643) ([alamb](https://github.com/alamb)) - Rename MergeExec to CoalescePartitionsExec [\#635](https://github.com/apache/arrow-datafusion/pull/635) ([andygrove](https://github.com/andygrove)) - fix 593, reduce cloning by taking ownership in logical planner's `from` fn [\#610](https://github.com/apache/arrow-datafusion/pull/610) ([Jimexist](https://github.com/Jimexist)) @@ -520,7 +520,7 @@ For older versions, see [apache/arrow/CHANGELOG.md](https://github.com/apache/ar - Use 4.x arrow-rs from crates.io rather than git sha [\#395](https://github.com/apache/arrow-datafusion/pull/395) ([alamb](https://github.com/alamb)) - Return Vec\ from PredicateBuilder rather than an `Fn` [\#370](https://github.com/apache/arrow-datafusion/pull/370) ([alamb](https://github.com/alamb)) - Refactor: move RowGroupPredicateBuilder into its own module, rename to PruningPredicateBuilder [\#365](https://github.com/apache/arrow-datafusion/pull/365) ([alamb](https://github.com/alamb)) -- \[Datafusion\] NOW\(\) function support [\#288](https://github.com/apache/arrow-datafusion/pull/288) ([msathis](https://github.com/msathis)) +- \[DataFusion\] NOW\(\) function support [\#288](https://github.com/apache/arrow-datafusion/pull/288) ([msathis](https://github.com/msathis)) - Implement select distinct [\#262](https://github.com/apache/arrow-datafusion/pull/262) ([Dandandan](https://github.com/Dandandan)) - Refactor datafusion/src/physical\_plan/common.rs build\_file\_list to take less param and reuse code [\#253](https://github.com/apache/arrow-datafusion/pull/253) ([Jimexist](https://github.com/Jimexist)) - Support qualified columns in queries [\#55](https://github.com/apache/arrow-datafusion/pull/55) ([houqp](https://github.com/houqp)) @@ -718,7 +718,7 @@ For older versions, see [apache/arrow/CHANGELOG.md](https://github.com/apache/ar - RFC Roadmap for 2021 \(DataFusion\) [\#140](https://github.com/apache/arrow-datafusion/issues/140) - Implement hash partitioning [\#131](https://github.com/apache/arrow-datafusion/issues/131) - Grouping by column position [\#110](https://github.com/apache/arrow-datafusion/issues/110) -- \[Datafusion\] GROUP BY with a high cardinality doesn't seem to finish [\#107](https://github.com/apache/arrow-datafusion/issues/107) +- \[DataFusion\] GROUP BY with a high cardinality doesn't seem to finish [\#107](https://github.com/apache/arrow-datafusion/issues/107) - \[Rust\] Add support for JSON data sources [\#103](https://github.com/apache/arrow-datafusion/issues/103) - \[Rust\] Implement metrics framework [\#95](https://github.com/apache/arrow-datafusion/issues/95) - Publically export Arrow crate from datafusion [\#36](https://github.com/apache/arrow-datafusion/issues/36) diff --git a/datafusion/src/execution/context.rs b/datafusion/src/execution/context.rs index 81f6b6cceb03..df795db1522c 100644 --- a/datafusion/src/execution/context.rs +++ b/datafusion/src/execution/context.rs @@ -833,7 +833,7 @@ pub struct ExecutionConfig { /// Should DataFusion repartition data using the partition keys to execute window functions in /// parallel using the provided `target_partitions` level pub repartition_windows: bool, - /// Should Datafusion parquet reader using the predicate to prune data + /// Should DataFusion parquet reader using the predicate to prune data parquet_pruning: bool, /// Runtime configurations such as memory threshold and local disk for spill pub runtime: RuntimeConfig, diff --git a/datafusion/src/physical_plan/planner.rs b/datafusion/src/physical_plan/planner.rs index 4055b1488422..0a76802a417b 100644 --- a/datafusion/src/physical_plan/planner.rs +++ b/datafusion/src/physical_plan/planner.rs @@ -583,7 +583,7 @@ impl DefaultPhysicalPlanner { // columns with names like `SUM(t1.c1)`, `t1.c1 + t1.c2`, etc. // // If we run these logical columns through physical_name function, we will - // get physical names with column qualifiers, which violates Datafusion's + // get physical names with column qualifiers, which violates DataFusion's // field name semantics. To account for this, we need to derive the // physical name from physical input instead. // diff --git a/dev/release/README.md b/dev/release/README.md index 467f9923f9a4..a8886fc46db1 100644 --- a/dev/release/README.md +++ b/dev/release/README.md @@ -21,18 +21,18 @@ ## Sub-projects -The Datafusion repo contains 2 different releasable sub-projects: Datafusion, Ballista +The DataFusion repo contains 2 different releasable sub-projects: DataFusion, Ballista -We use Datafusion release to drive the release for the other sub-projects. As a -result, Datafusion version bump is required for every release while version +We use DataFusion release to drive the release for the other sub-projects. As a +result, DataFusion version bump is required for every release while version bumps for the Python binding and Ballista are optional. In other words, we can -release a new version of Datafusion without releasing a new version of the +release a new version of DataFusion without releasing a new version of the Python binding or Ballista. On the other hand, releasing a new version of the -Python binding or Ballista always requires a new Datafusion version release. +Python binding or Ballista always requires a new DataFusion version release. ## Branching -Datafusion currently only releases from the `master` branch. Given the project +DataFusion currently only releases from the `master` branch. Given the project is still in early development state, we are not maintaining an active stable release backport branch. @@ -177,11 +177,11 @@ Send the email output from the script to dev@arrow.apache.org. The email should ``` To: dev@arrow.apache.org -Subject: [VOTE][Datafusion] Release Apache Arrow Datafusion 5.1.0 RC0 +Subject: [VOTE][DataFusion] Release Apache Arrow DataFusion 5.1.0 RC0 Hi, -I would like to propose a release of Apache Arrow Datafusion Implementation, +I would like to propose a release of Apache Arrow DataFusion Implementation, version 5.1.0. This release candidate is based on commit: a5dd428f57e62db20a945e8b1895de91405958c4 [1] @@ -193,9 +193,9 @@ and vote on the release. The vote will be open for at least 72 hours. -[ ] +1 Release this as Apache Arrow Datafusion 5.1.0 +[ ] +1 Release this as Apache Arrow DataFusion 5.1.0 [ ] +0 -[ ] -1 Do not release this as Apache Arrow Datafusion 5.1.0 because... +[ ] -1 Do not release this as Apache Arrow DataFusion 5.1.0 because... [1]: https://github.com/apache/arrow-datafusion/tree/a5dd428f57e62db20a945e8b1895de91405958c4 [2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-5.1.0 diff --git a/dev/release/create-tarball.sh b/dev/release/create-tarball.sh index c668ea8708a5..495ab374fcaf 100755 --- a/dev/release/create-tarball.sh +++ b/dev/release/create-tarball.sh @@ -80,10 +80,10 @@ echo "" echo "---------------------------------------------------------" cat < + DataFusion Ballista .. _toc.community: diff --git a/docs/source/specification/output-field-name-semantic.md b/docs/source/specification/output-field-name-semantic.md index bc0813abd06b..c866573447b0 100644 --- a/docs/source/specification/output-field-name-semantic.md +++ b/docs/source/specification/output-field-name-semantic.md @@ -17,11 +17,11 @@ under the License. --> -# Datafusion output field name semantic +# DataFusion output field name semantic This specification documents how field names in output record batches should be generated based on given user queries. The filed name rules apply to -Datafusion queries planned from both SQL queries and Dataframe APIs. +DataFusion queries planned from both SQL queries and Dataframe APIs. ## Field name rules @@ -66,7 +66,7 @@ FROM t1 JOIN t2 ON t1.id = t2.id ``` -Datafusion Arrow record batches output: +DataFusion Arrow record batches output: | id | a | id | b | | --- | --- | --- | ----- | @@ -95,7 +95,7 @@ Query: SELECT ABS(t1.id), abs(-id) FROM t1; ``` -Datafusion Arrow record batches output: +DataFusion Arrow record batches output: | abs(t1.id) | abs((- t1.id)) | | ---------- | -------------- | @@ -138,7 +138,7 @@ Query: SELECT t1.id + ABS(id), ABS(id * t1.id) FROM t1; ``` -Datafusion Arrow record batches output: +DataFusion Arrow record batches output: | t1.id + abs(t1.id) | abs(t1.id \* t1.id) | | ------------------ | ------------------- | @@ -181,7 +181,7 @@ Query: SELECT 1, 2+5, 'foo_bar'; ``` -Datafusion Arrow record batches output: +DataFusion Arrow record batches output: | 1 | (2 + 5) | foo_bar | | --- | ------- | ------- | diff --git a/docs/source/specification/quarterly_roadmap.md b/docs/source/specification/quarterly_roadmap.md index 5bb805d7e7f0..d193952767af 100644 --- a/docs/source/specification/quarterly_roadmap.md +++ b/docs/source/specification/quarterly_roadmap.md @@ -42,7 +42,7 @@ A quarterly roadmap will be published to give the DataFusion community visibilit ### New Features - Read JSON as table -- Simplify DDL with Datafusion-Cli +- Simplify DDL with DataFusion-Cli - Add Decimal128 data type and the attendant features such as Arrow Kernel and UDF support - Add new experimental e-graph based optimizer diff --git a/docs/source/specification/rfcs/template.md b/docs/source/specification/rfcs/template.md index 98704fd46fe9..a6f79fe939ce 100644 --- a/docs/source/specification/rfcs/template.md +++ b/docs/source/specification/rfcs/template.md @@ -27,7 +27,7 @@ Authors: RFC PR: # -Datafusion Issue: # +DataFusion Issue: # --- diff --git a/docs/source/user-guide/sql/datafusion-functions.md b/docs/source/user-guide/sql/datafusion-functions.md index 8431baf2a3b1..aa8001270e95 100644 --- a/docs/source/user-guide/sql/datafusion-functions.md +++ b/docs/source/user-guide/sql/datafusion-functions.md @@ -17,7 +17,7 @@ under the License. --> -# Datafusion-Specific Functions +# DataFusion-Specific Functions These SQL functions are specific to DataFusion, or they are well known and have functionality which is specific to DataFusion. Specifically, the `to_timestamp_xx()` functions exist due to Arrow's support for multiple timestamp resolutions.