Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support: Date +/plus Int or date_add function #6876

Open
liukun4515 opened this issue Jul 7, 2023 · 8 comments
Open

support: Date +/plus Int or date_add function #6876

liukun4515 opened this issue Jul 7, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@liukun4515
Copy link
Contributor

Is your feature request related to a problem or challenge?

we have requirements related to the date operation, such as date_add, date_sub.

In the spark and mysql, the date_add(start_date, value_expr) function has been supported.
But in the datafusion, we only use the expr + interval 'value' day to support them.

In the mysql or the spark, the second args can be the expr, but the datafusion has no method to get the same target.

Describe the solution you'd like

In the PG document, PG support date +/plus integer.

We can try to support this feature in the datafusion or in the kernel of arrow-rs.

Describe alternatives you've considered

No response

Additional context

cc @waitingkuo @alamb

@liukun4515 liukun4515 added the enhancement New feature or request label Jul 7, 2023
@alamb
Copy link
Contributor

alamb commented Jul 7, 2023

👍 -- I believe @tustvold is cleaning up the arithmetic logic in arrow-rs / datafusion now

What types can the value_expr be in spark?

@watfordkcf
Copy link

watfordkcf commented Jul 10, 2023

It takes integers:

spark-sql (default)> SELECT date_add(CAST(now() AS DATE), 1.5);
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "date_add(CAST(now() AS DATE), 1.5)" due to data type mismatch:
    Parameter 2 requires the ("INT" or "SMALLINT" or "TINYINT") type, however "1.5" has the type "DECIMAL(2,1)".; line 1 pos 7;
'Project [unresolvedalias(date_add(cast(now() as date), 1.5), None)]
+- OneRowRelation

@alamb
Copy link
Contributor

alamb commented Jul 10, 2023

https://spark.apache.org/docs/2.3.0/api/sql/#date_add

It appears that date_add should take the number of days as the second argument

To model this in DataFusion you could probably do a rewrite like this

date_add(x, y)
cast(x as date) + cast(y as IntervalMonthDayNano)

You could also add a new built in function (which would probably make the type casting logic more explicit)

@liukun4515
Copy link
Contributor Author

liukun4515 commented Jul 11, 2023

👍 -- I believe @tustvold is cleaning up the arithmetic logic in arrow-rs / datafusion now

Ok, I will take look this work and track this process of work

What types can the value_expr be in spark?

In the spark

spark-sql> select version();
3.2.0 5d45a415f3a29898d92380380cfd82bfc7f579ea
Time taken: 0.084 seconds, Fetched 1 row(s)

spark-sql> desc test;
a                   	date
b                   	int

date + integer constant

spark-sql> explain extended select a+10 from test;
== Parsed Logical Plan ==
'Project [unresolvedalias(('a + 10), None)]
+- 'UnresolvedRelation [test], [], false

== Analyzed Logical Plan ==
date_add(a, 10): date
Project [date_add(a#49, 10) AS date_add(a, 10)#51]
+- SubqueryAlias spark_catalog.default.test
   +- HiveTableRelation [`default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [a#49, b#50], Partition Cols: []]

== Optimized Logical Plan ==
Project [date_add(a#49, 10) AS date_add(a, 10)#51]
+- HiveTableRelation [`default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [a#49, b#50], Partition Cols: []]

== Physical Plan ==
*(1) Project [date_add(a#49, 10) AS date_add(a, 10)#51]
+- Scan hive default.test [a#49], HiveTableRelation [`default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [a#49, b#50], Partition Cols: []]

Time taken: 0.04 seconds, Fetched 1 row(s)

There is a specific analyse rule to handle the data/time with the operation of arithmetic, the rule convert the date + integer to the build-in function date_add.

date +/date_add integer column/expr

spark-sql> explain extended select a+b from test;
== Parsed Logical Plan ==
'Project [unresolvedalias(('a + 'b), None)]
+- 'UnresolvedRelation [test], [], false

== Analyzed Logical Plan ==
date_add(a, b): date
Project [date_add(a#88, b#89) AS date_add(a, b)#90]
+- SubqueryAlias spark_catalog.default.test
   +- HiveTableRelation [`default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [a#88, b#89], Partition Cols: []]

== Optimized Logical Plan ==
Project [date_add(a#88, b#89) AS date_add(a, b)#90]
+- HiveTableRelation [`default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [a#88, b#89], Partition Cols: []]

== Physical Plan ==
*(1) Project [date_add(a#88, b#89) AS date_add(a, b)#90]
+- Scan hive default.test [a#88, b#89], HiveTableRelation [`default`.`test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [a#88, b#89], Partition Cols: []]

Time taken: 0.035 seconds, Fetched 1 row(s)

because the PG support the operation date +/- integer described in the doc https://www.postgresql.org/docs/current/functions-datetime.html
For example

date + integer → date

Add a number of days to a date

date '2001-09-28' + 7 → 2001-10-05

So I want to support the more arithmetic operation for date/time/timestamp/interval in the datafusion(maybe we can implement them in the arrow-rs).

The date operated by the arithmetic operation is required in the sql system or the query engine, So i don't know if the implementation of above operation in the arrow-rs kernel is suitable?

@tustvold
Copy link
Contributor

Upstream will, in the next release, support date subtraction returning a duration in addition to adding durations or intervals to dates. I think the semantics of adding plain integers to dates is suitably opaque that we probably won't support that upstream, but the approach of using a rewrite to convert to a duration makes a lot of sense

@watfordkcf
Copy link

Postgres has a function make_interval:

make_interval ( [ years int [, months int [, weeks int [, days int [, hours int [, mins int [, secs double precision ]]]]]]] ) → interval

Create interval from years, months, weeks, days, hours, minutes and seconds fields, each of which can default to zero

make_interval(days => 10) → 10 days

Would this help if ported @liukun4515 ?

@liukun4515
Copy link
Contributor Author

Upstream will, in the next release, support date subtraction returning a duration in addition to adding durations or intervals to dates.

What‘s the upstream? arrow-rs?

I think the semantics of adding plain integers to dates is suitably opaque that we probably won't support that upstream, but the approach of using a rewrite to convert to a duration makes a lot of sense

I also think the semantics of adding plain integers to dates is not suitable for adding in the arrow-rs.

The pg https://www.postgresql.org/docs/current/functions-datetime.html support the operation

date + integer → date

Add a number of days to a date

date '2001-09-28' + 7 → 2001-10-05

We can use the rewrite to support them.

Thanks @tustvold

@liukun4515
Copy link
Contributor Author

liukun4515 commented Jul 13, 2023

Would this help if ported

Yes, i have a plan to implement this function from below discussion #3148 (comment)

and create issue to track this #6951

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants