Skip to content

Commit 6539441

Browse files
revans2ttnghia
andauthored
Enable JSON Scan and from_json by default (#11753)
Signed-off-by: Robert (Bobby) Evans <[email protected]> Co-authored-by: Nghia Truong <[email protected]>
1 parent 6cba00d commit 6539441

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+151
-176
lines changed

docs/additional-functionality/advanced_configs.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -95,8 +95,8 @@ Name | Description | Default Value | Applicable at
9595
<a name="sql.format.hive.text.write.enabled"></a>spark.rapids.sql.format.hive.text.write.enabled|When set to false disables Hive text table write acceleration|false|Runtime
9696
<a name="sql.format.iceberg.enabled"></a>spark.rapids.sql.format.iceberg.enabled|When set to false disables all Iceberg acceleration|true|Runtime
9797
<a name="sql.format.iceberg.read.enabled"></a>spark.rapids.sql.format.iceberg.read.enabled|When set to false disables Iceberg input acceleration|true|Runtime
98-
<a name="sql.format.json.enabled"></a>spark.rapids.sql.format.json.enabled|When set to true enables all json input and output acceleration. (only input is currently supported anyways)|false|Runtime
99-
<a name="sql.format.json.read.enabled"></a>spark.rapids.sql.format.json.read.enabled|When set to true enables json input acceleration|false|Runtime
98+
<a name="sql.format.json.enabled"></a>spark.rapids.sql.format.json.enabled|When set to true enables all json input and output acceleration. (only input is currently supported anyways)|true|Runtime
99+
<a name="sql.format.json.read.enabled"></a>spark.rapids.sql.format.json.read.enabled|When set to true enables json input acceleration|true|Runtime
100100
<a name="sql.format.orc.enabled"></a>spark.rapids.sql.format.orc.enabled|When set to false disables all orc input and output acceleration|true|Runtime
101101
<a name="sql.format.orc.floatTypesToString.enable"></a>spark.rapids.sql.format.orc.floatTypesToString.enable|When reading an ORC file, the source data schemas(schemas of ORC file) may differ from the target schemas (schemas of the reader), we need to handle the castings from source type to target type. Since float/double numbers in GPU have different precision with CPU, when casting float/double to string, the result of GPU is different from result of CPU spark. Its default value is `true` (this means the strings result will differ from result of CPU). If it's set `false` explicitly and there exists casting from float/double to string in the job, then such behavior will cause an exception, and the job will fail.|true|Runtime
102102
<a name="sql.format.orc.multiThreadedRead.maxNumFilesParallel"></a>spark.rapids.sql.format.orc.multiThreadedRead.maxNumFilesParallel|A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel. Used with MULTITHREADED reader, see spark.rapids.sql.format.orc.reader.type.|2147483647|Runtime
@@ -278,7 +278,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
278278
<a name="sql.expression.IsNaN"></a>spark.rapids.sql.expression.IsNaN|`isnan`|Checks if a value is NaN|true|None|
279279
<a name="sql.expression.IsNotNull"></a>spark.rapids.sql.expression.IsNotNull|`isnotnull`|Checks if a value is not null|true|None|
280280
<a name="sql.expression.IsNull"></a>spark.rapids.sql.expression.IsNull|`isnull`|Checks if a value is null|true|None|
281-
<a name="sql.expression.JsonToStructs"></a>spark.rapids.sql.expression.JsonToStructs|`from_json`|Returns a struct value with the given `jsonStr` and `schema`|false|This is disabled by default because it is currently in beta and undergoes continuous enhancements. Please consult the [compatibility documentation](../compatibility.md#json-supporting-types) to determine whether you can enable this configuration for your use case|
281+
<a name="sql.expression.JsonToStructs"></a>spark.rapids.sql.expression.JsonToStructs|`from_json`|Returns a struct value with the given `jsonStr` and `schema`|true|None|
282282
<a name="sql.expression.JsonTuple"></a>spark.rapids.sql.expression.JsonTuple|`json_tuple`|Returns a tuple like the function get_json_object, but it takes multiple names. All the input parameters and output column types are string.|false|This is disabled by default because Experimental feature that could be unstable or have performance issues.|
283283
<a name="sql.expression.KnownFloatingPointNormalized"></a>spark.rapids.sql.expression.KnownFloatingPointNormalized| |Tag to prevent redundant normalization|true|None|
284284
<a name="sql.expression.KnownNotNull"></a>spark.rapids.sql.expression.KnownNotNull| |Tag an expression as known to not be null|true|None|

docs/compatibility.md

+69-92
Original file line numberDiff line numberDiff line change
@@ -316,133 +316,110 @@ case.
316316

317317
## JSON
318318

319-
The JSON format read is an experimental feature which is expected to have some issues, so we disable
320-
it by default. If you would like to test it, you need to enable `spark.rapids.sql.format.json.enabled` and
321-
`spark.rapids.sql.format.json.read.enabled`.
319+
JSON, despite being a standard format, has some ambiguity in it. Spark also offers the ability to allow
320+
some invalid JSON to be parsed. We have tried to provide JSON parsing that is compatible with
321+
what Apache Spark does support. Note that Spark itself has changed through different releases, and we will
322+
try to call out which releases we offer different results for. JSON parsing is enabled by default
323+
except for date and timestamp types where we still have work to complete. If you wish to disable
324+
JSON Scan you can set `spark.rapids.sql.format.json.enabled` or
325+
`spark.rapids.sql.format.json.read.enabled` to false. To disable `from_json` you can set
326+
`spark.rapids.sql.expression.JsonToStructs` to false.
322327

323-
### Invalid JSON
328+
### Limits
324329

325-
In Apache Spark on the CPU if a line in the JSON file is invalid the entire row is considered
326-
invalid and will result in nulls being returned for all columns. It is considered invalid if it
327-
violates the JSON specification, but with a few extensions.
330+
In versions of Spark before 3.5.0 there is no maximum to how deeply nested JSON can be. After
331+
3.5.0 this was updated to be 1,000 by default. The current GPU implementation of JSON Scan and
332+
`from_json` limits this to 254 no matter what version of Spark is used. If the nesting level is
333+
over this the JSON is considered invalid and all values will be returned as nulls.
334+
`get_json_object` and `json_tuple` have a maximum nesting depth of 64. An exception is thrown if
335+
the nesting depth goes over the maximum.
328336

329-
* Single quotes are allowed to quote strings and keys
330-
* Unquoted values like NaN and Infinity can be parsed as floating point values
331-
* Control characters do not need to be replaced with the corresponding escape sequences in a
332-
quoted string.
333-
* Garbage at the end of a row, if there is valid JSON at the beginning of the row, is ignored.
337+
Spark 3.5.0 and above have limits on maximum string length 20,000,000 and maximum number length of
338+
1,000. We do not have any of these limits on the GPU.
334339

335-
The GPU implementation does the same kinds of validations, but many of them are done on a per-column
336-
basis, which, for example, means if a number is formatted incorrectly, it is likely only that value
337-
will be considered invalid and return a null instead of nulls for the entire row.
340+
We, like Spark, cannot support an JSON string that is larger than 2 GiB is size.
338341

339-
There are options that can be used to enable and disable many of these features which are mostly
340-
listed below.
342+
### JSON Validation
341343

342-
### JSON options
344+
Spark supports the option `allowNonNumericNumbers`. Versions of Spark prior to 3.3.0 where inconsistent between
345+
quoted and non-quoted values ([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The
346+
GPU implementation is consistent with 3.3.0 and above.
343347

344-
Spark supports passing options to the JSON parser when reading a dataset. In most cases if the RAPIDS Accelerator
345-
sees one of these options that it does not support it will fall back to the CPU. In some cases we do not. The
346-
following options are documented below.
348+
### JSON Floating Point Types
347349

348-
- `allowNumericLeadingZeros` - Allows leading zeros in numbers (e.g. 00012). By default this is set to false.
349-
When it is false Spark considers the JSON invalid if it encounters this type of number. The RAPIDS
350-
Accelerator supports validating columns that are returned to the user with this option on or off.
351-
352-
- `allowUnquotedControlChars` - Allows JSON Strings to contain unquoted control characters (ASCII characters with
353-
value less than 32, including tab and line feed characters) or not. By default this is set to false. If the schema
354-
is provided while reading JSON file, then this flag has no impact on the RAPIDS Accelerator as it always allows
355-
unquoted control characters but Spark sees these are invalid are returns nulls. However, if the schema is not provided
356-
and this option is false, then RAPIDS Accelerator's behavior is same as Spark where an exception is thrown
357-
as discussed in `JSON Schema discovery` section.
358-
359-
- `allowNonNumericNumbers` - Allows `NaN` and `Infinity` values to be parsed (note that these are not valid numeric
360-
values in the [JSON specification](https://json.org)). Spark versions prior to 3.3.0 have inconsistent behavior and will
361-
parse some variants of `NaN` and `Infinity` even when this option is disabled
362-
([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The RAPIDS Accelerator behavior is consistent with
363-
Spark version 3.3.0 and later.
364-
365-
### Nesting
366-
In versions of Spark before 3.5.0 there is no maximum to how deeply nested JSON can be. After
367-
3.5.0 this was updated to be 1000 by default. The current GPU implementation limits this to 254
368-
no matter what version of Spark is used. If the nesting level is over this the JSON is considered
369-
invalid and all values will be returned as nulls.
370-
371-
Mixed types can have some problems. If an item being read could have some lines that are arrays
372-
and others that are structs/dictionaries it is possible an error will be thrown.
373-
374-
Dates and Timestamps have some issues and may return values for technically invalid inputs.
375-
376-
Floating point numbers have issues generally like with the rest of Spark, and we can parse them into
377-
a valid floating point number, but it might not match 100% with the way Spark does it.
378-
379-
Strings are supported, but the data returned might not be normalized in the same way as the CPU
380-
implementation. Generally this comes down to the GPU not modifying the input, whereas Spark will
381-
do things like remove extra white space and parse numbers before turning them back into a string.
350+
Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).
382351

383-
### JSON Floating Point
352+
### JSON Integral Types
384353

385-
Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).
354+
Versions of Spark prior to 3.3.0 would parse quoted integer values, like "1". But 3.3.0 and above consider
355+
these to be invalid and will return `null` when parsed as an Integral types. The GPU implementation
356+
follows 3.3.0 and above.
386357

387-
Prior to Spark 3.3.0, reading JSON strings such as `"+Infinity"` when specifying that the data type is `FloatType`
388-
or `DoubleType` caused these values to be parsed even when `allowNonNumericNumbers` is set to false. Also, Spark
389-
versions prior to 3.3.0 only supported the `"Infinity"` and `"-Infinity"` representations of infinity and did not
390-
support `"+INF"`, `"-INF"`, or `"+Infinity"`, which Spark considers valid when unquoted. The GPU JSON reader is
391-
consistent with the behavior in Spark 3.3.0 and later.
358+
### JSON Decimal Types
392359

393-
Another limitation of the GPU JSON reader is that it will parse strings containing non-string boolean or numeric values where
394-
Spark will treat them as invalid inputs and will just return `null`.
360+
Spark supports parsing decimal types either formatted as floating point number or integral numbers, even if it is
361+
in a quoted string. If it is in a quoted string the local of the JVM is used to determine the number format.
362+
If the local is not for the `US`, which is the default we will fall back to the CPU because we do not currently
363+
parse those numbers correctly. The `US` format removes all commas ',' from the quoted string.
364+
As a part of this, though, non-arabic numbers are also supported. We do not support parsing these numbers
365+
see (issue 10532)[https://github.com/NVIDIA/spark-rapids/issues/10532].
395366

396-
### JSON Dates/Timestamps
367+
### JSON Date/Timestamp Types
397368

398369
Dates and timestamps are not supported by default in JSON parser, since the GPU implementation is not 100%
399370
compatible with Apache Spark.
400371
If needed, they can be turned on through the config `spark.rapids.sql.json.read.datetime.enabled`.
401-
Once enabled, the JSON parser still does not support the `TimestampNTZ` type and will fall back to CPU
402-
if `spark.sql.timestampType` is set to `TIMESTAMP_NTZ` or if an explicit schema is provided that
403-
contains the `TimestampNTZ` type.
372+
This config works for both JSON scan and `from_json`. Once enabled, the JSON parser still does
373+
not support the `TimestampNTZ` type and will fall back to CPU if `spark.sql.timestampType` is set
374+
to `TIMESTAMP_NTZ` or if an explicit schema is provided that contains the `TimestampNTZ` type.
404375

405376
There is currently no support for reading numeric values as timestamps and null values are returned instead
406-
([#4940](https://github.com/NVIDIA/spark-rapids/issues/4940)). A workaround would be to read as longs and then cast
407-
to timestamp.
377+
([#4940](https://github.com/NVIDIA/spark-rapids/issues/4940)). A workaround would be to read as longs and then cast to timestamp.
408378

409-
### JSON Schema discovery
379+
### JSON Arrays and Structs with Overflowing Numbers
410380

411-
Spark SQL can automatically infer the schema of a JSON dataset if schema is not provided explicitly. The CPU
412-
handles schema discovery and there is no GPU acceleration of this. By default Spark will read/parse the entire
413-
dataset to determine the schema. This means that some options/errors which are ignored by the GPU may still
414-
result in an exception if used with schema discovery.
381+
Spark is inconsistent between versions in how it handles numbers that overflow that are nested in either an array
382+
or a non-top-level struct. In some versions only the value that overflowed is marked as null. In other versions the
383+
wrapping array or struct is marked as null. We currently only mark the individual value as null. This matches
384+
versions 3.4.2 and above of Spark for structs. Arrays on most versions of spark invalidate the entire array if there
385+
is a single value that overflows within it.
415386

416-
### `from_json` function
387+
### Duplicate Struct Names
417388

418-
`JsonToStructs` of `from_json` is based on the same code as reading a JSON lines file. There are
419-
a few differences with it.
389+
The JSON specification technically allows for duplicate keys in a struct, but does not explain what to
390+
do with them. In the case of Spark it is inconsistent between operators which value wins. `get_json_object`
391+
depends on the query being performed. We do not always match what Spark does. We do match it in many cases,
392+
but we consider this enough of a corner case that we have not tried to make it work in all cases.
420393

421-
The `from_json` function is disabled by default because it is experimental and has some known
422-
incompatibilities with Spark, and can be enabled by setting
423-
`spark.rapids.sql.expression.JsonToStructs=true`. You don't need to set
424-
`spark.rapids.sql.format.json.enabled` and`spark.rapids.sql.format.json.read.enabled` to true.
425-
In addition, if the input schema contains date and/or timestamp types, an additional config
426-
`spark.rapids.sql.json.read.datetime.enabled` also needs to be set to `true` in order
427-
to enable this function on the GPU.
394+
We also do not support schemas where there are duplicate column names. We just fall back to the CPU for those cases.
428395

429-
There is no schema discovery as a schema is required as input to `from_json`
396+
### JSON Normalization (String Types)
430397

431-
In addition to `structs`, a top level `map` type is supported, but only if the key and value are
432-
strings.
398+
In versions of Spark prior to 4.0.0 input JSON Strings were parsed to JSON tokens and then converted back to
399+
strings. This effectively normalizes the output string. So things like single quotes are transformed into double
400+
quotes, floating point numbers are parsed and converted back to strings possibly changing the format, and
401+
escaped characters are converted back to their simplest form. We try to support this on the GPU as well. Single quotes
402+
will be converted to double quotes. Only `get_json_object` and `json_tuple` attempt to normalize floating point
403+
numbers. There is no implementation on the GPU right now that tries to normalize escape characters.
404+
405+
### `from_json` Function
406+
407+
`JsonToStructs` or `from_json` is based on the same code as reading a JSON lines file. There are
408+
a few differences with it.
433409

434-
### `to_json` function
410+
The main difference is that `from_json` supports parsing Maps and Arrays directly from a JSON column, whereas
411+
JSON Scan only supports parsing top level structs. The GPU implementation of `from_json` has support for parsing
412+
a `MAP<STRING,STRING>` as a top level schema, but does not currently support arrays at the top level.
435413

436-
The `to_json` function is disabled by default because it is experimental and has some known incompatibilities
437-
with Spark, and can be enabled by setting `spark.rapids.sql.expression.StructsToJson=true`.
414+
### `to_json` Function
438415

439416
Known issues are:
440417

441418
- There can be rounding differences when formatting floating-point numbers as strings. For example, Spark may
442419
produce `-4.1243574E26` but the GPU may produce `-4.124357351E26`.
443420
- Not all JSON options are respected
444421

445-
### get_json_object
422+
### `get_json_object` Function
446423

447424
Known issue:
448425
- [Floating-point number normalization error](https://github.com/NVIDIA/spark-rapids-jni/issues/1922). `get_json_object` floating-point number normalization on the GPU could sometimes return incorrect results if the string contains high-precision values, see the String to Float and Float to String section for more details.

docs/supported_ops.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -9279,7 +9279,7 @@ are limited.
92799279
<td rowSpan="2">JsonToStructs</td>
92809280
<td rowSpan="2">`from_json`</td>
92819281
<td rowSpan="2">Returns a struct value with the given `jsonStr` and `schema`</td>
9282-
<td rowSpan="2">This is disabled by default because it is currently in beta and undergoes continuous enhancements. Please consult the [compatibility documentation](../compatibility.md#json-supporting-types) to determine whether you can enable this configuration for your use case</td>
9282+
<td rowSpan="2">None</td>
92839283
<td rowSpan="2">project</td>
92849284
<td>jsonStr</td>
92859285
<td> </td>
@@ -9320,7 +9320,7 @@ are limited.
93209320
<td> </td>
93219321
<td> </td>
93229322
<td><b>NS</b></td>
9323-
<td><em>PS<br/>MAP only supports keys and values that are of STRING type;<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types NULL, BINARY, CALENDAR, MAP, UDT, DAYTIME, YEARMONTH</em></td>
9323+
<td><em>PS<br/>MAP only supports keys and values that are of STRING type and is only supported at the top level;<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types NULL, BINARY, CALENDAR, MAP, UDT, DAYTIME, YEARMONTH</em></td>
93249324
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types NULL, BINARY, CALENDAR, MAP, UDT, DAYTIME, YEARMONTH</em></td>
93259325
<td> </td>
93269326
<td> </td>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

+3-5
Original file line numberDiff line numberDiff line change
@@ -3780,7 +3780,8 @@ object GpuOverrides extends Logging {
37803780
ExprChecks.projectOnly(
37813781
TypeSig.STRUCT.nested(jsonStructReadTypes) +
37823782
TypeSig.MAP.nested(TypeSig.STRING).withPsNote(TypeEnum.MAP,
3783-
"MAP only supports keys and values that are of STRING type"),
3783+
"MAP only supports keys and values that are of STRING type " +
3784+
"and is only supported at the top level"),
37843785
(TypeSig.STRUCT + TypeSig.MAP + TypeSig.ARRAY).nested(TypeSig.all),
37853786
Seq(ParamCheck("jsonStr", TypeSig.STRING, TypeSig.STRING))),
37863787
(a, conf, p, r) => new UnaryExprMeta[JsonToStructs](a, conf, p, r) {
@@ -3821,10 +3822,7 @@ object GpuOverrides extends Logging {
38213822
override def convertToGpu(child: Expression): GpuExpression =
38223823
// GPU implementation currently does not support duplicated json key names in input
38233824
GpuJsonToStructs(a.schema, a.options, child, a.timeZoneId)
3824-
}).disabledByDefault("it is currently in beta and undergoes continuous enhancements."+
3825-
" Please consult the "+
3826-
"[compatibility documentation](../compatibility.md#json-supporting-types)"+
3827-
" to determine whether you can enable this configuration for your use case"),
3825+
}),
38283826
expr[StructsToJson](
38293827
"Converts structs to JSON text format",
38303828
ExprChecks.projectOnly(

0 commit comments

Comments
 (0)