You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/additional-functionality/advanced_configs.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -95,8 +95,8 @@ Name | Description | Default Value | Applicable at
95
95
<aname="sql.format.hive.text.write.enabled"></a>spark.rapids.sql.format.hive.text.write.enabled|When set to false disables Hive text table write acceleration|false|Runtime
96
96
<aname="sql.format.iceberg.enabled"></a>spark.rapids.sql.format.iceberg.enabled|When set to false disables all Iceberg acceleration|true|Runtime
97
97
<aname="sql.format.iceberg.read.enabled"></a>spark.rapids.sql.format.iceberg.read.enabled|When set to false disables Iceberg input acceleration|true|Runtime
98
-
<aname="sql.format.json.enabled"></a>spark.rapids.sql.format.json.enabled|When set to true enables all json input and output acceleration. (only input is currently supported anyways)|false|Runtime
99
-
<aname="sql.format.json.read.enabled"></a>spark.rapids.sql.format.json.read.enabled|When set to true enables json input acceleration|false|Runtime
98
+
<aname="sql.format.json.enabled"></a>spark.rapids.sql.format.json.enabled|When set to true enables all json input and output acceleration. (only input is currently supported anyways)|true|Runtime
99
+
<aname="sql.format.json.read.enabled"></a>spark.rapids.sql.format.json.read.enabled|When set to true enables json input acceleration|true|Runtime
100
100
<aname="sql.format.orc.enabled"></a>spark.rapids.sql.format.orc.enabled|When set to false disables all orc input and output acceleration|true|Runtime
101
101
<aname="sql.format.orc.floatTypesToString.enable"></a>spark.rapids.sql.format.orc.floatTypesToString.enable|When reading an ORC file, the source data schemas(schemas of ORC file) may differ from the target schemas (schemas of the reader), we need to handle the castings from source type to target type. Since float/double numbers in GPU have different precision with CPU, when casting float/double to string, the result of GPU is different from result of CPU spark. Its default value is `true` (this means the strings result will differ from result of CPU). If it's set `false` explicitly and there exists casting from float/double to string in the job, then such behavior will cause an exception, and the job will fail.|true|Runtime
102
102
<aname="sql.format.orc.multiThreadedRead.maxNumFilesParallel"></a>spark.rapids.sql.format.orc.multiThreadedRead.maxNumFilesParallel|A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel. Used with MULTITHREADED reader, see spark.rapids.sql.format.orc.reader.type.|2147483647|Runtime
@@ -278,7 +278,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
278
278
<aname="sql.expression.IsNaN"></a>spark.rapids.sql.expression.IsNaN|`isnan`|Checks if a value is NaN|true|None|
279
279
<aname="sql.expression.IsNotNull"></a>spark.rapids.sql.expression.IsNotNull|`isnotnull`|Checks if a value is not null|true|None|
280
280
<aname="sql.expression.IsNull"></a>spark.rapids.sql.expression.IsNull|`isnull`|Checks if a value is null|true|None|
281
-
<aname="sql.expression.JsonToStructs"></a>spark.rapids.sql.expression.JsonToStructs|`from_json`|Returns a struct value with the given `jsonStr` and `schema`|false|This is disabled by default because it is currently in beta and undergoes continuous enhancements. Please consult the [compatibility documentation](../compatibility.md#json-supporting-types) to determine whether you can enable this configuration for your use case|
281
+
<aname="sql.expression.JsonToStructs"></a>spark.rapids.sql.expression.JsonToStructs|`from_json`|Returns a struct value with the given `jsonStr` and `schema`|true|None|
282
282
<aname="sql.expression.JsonTuple"></a>spark.rapids.sql.expression.JsonTuple|`json_tuple`|Returns a tuple like the function get_json_object, but it takes multiple names. All the input parameters and output column types are string.|false|This is disabled by default because Experimental feature that could be unstable or have performance issues.|
283
283
<aname="sql.expression.KnownFloatingPointNormalized"></a>spark.rapids.sql.expression.KnownFloatingPointNormalized| |Tag to prevent redundant normalization|true|None|
284
284
<aname="sql.expression.KnownNotNull"></a>spark.rapids.sql.expression.KnownNotNull| |Tag an expression as known to not be null|true|None|
Copy file name to clipboardexpand all lines: docs/compatibility.md
+69-92
Original file line number
Diff line number
Diff line change
@@ -316,133 +316,110 @@ case.
316
316
317
317
## JSON
318
318
319
-
The JSON format read is an experimental feature which is expected to have some issues, so we disable
320
-
it by default. If you would like to test it, you need to enable `spark.rapids.sql.format.json.enabled` and
321
-
`spark.rapids.sql.format.json.read.enabled`.
319
+
JSON, despite being a standard format, has some ambiguity in it. Spark also offers the ability to allow
320
+
some invalid JSON to be parsed. We have tried to provide JSON parsing that is compatible with
321
+
what Apache Spark does support. Note that Spark itself has changed through different releases, and we will
322
+
try to call out which releases we offer different results for. JSON parsing is enabled by default
323
+
except for date and timestamp types where we still have work to complete. If you wish to disable
324
+
JSON Scan you can set `spark.rapids.sql.format.json.enabled` or
325
+
`spark.rapids.sql.format.json.read.enabled` to false. To disable `from_json` you can set
326
+
`spark.rapids.sql.expression.JsonToStructs` to false.
322
327
323
-
### Invalid JSON
328
+
### Limits
324
329
325
-
In Apache Spark on the CPU if a line in the JSON file is invalid the entire row is considered
326
-
invalid and will result in nulls being returned for all columns. It is considered invalid if it
327
-
violates the JSON specification, but with a few extensions.
330
+
In versions of Spark before 3.5.0 there is no maximum to how deeply nested JSON can be. After
331
+
3.5.0 this was updated to be 1,000 by default. The current GPU implementation of JSON Scan and
332
+
`from_json` limits this to 254 no matter what version of Spark is used. If the nesting level is
333
+
over this the JSON is considered invalid and all values will be returned as nulls.
334
+
`get_json_object` and `json_tuple` have a maximum nesting depth of 64. An exception is thrown if
335
+
the nesting depth goes over the maximum.
328
336
329
-
* Single quotes are allowed to quote strings and keys
330
-
* Unquoted values like NaN and Infinity can be parsed as floating point values
331
-
* Control characters do not need to be replaced with the corresponding escape sequences in a
332
-
quoted string.
333
-
* Garbage at the end of a row, if there is valid JSON at the beginning of the row, is ignored.
337
+
Spark 3.5.0 and above have limits on maximum string length 20,000,000 and maximum number length of
338
+
1,000. We do not have any of these limits on the GPU.
334
339
335
-
The GPU implementation does the same kinds of validations, but many of them are done on a per-column
336
-
basis, which, for example, means if a number is formatted incorrectly, it is likely only that value
337
-
will be considered invalid and return a null instead of nulls for the entire row.
340
+
We, like Spark, cannot support an JSON string that is larger than 2 GiB is size.
338
341
339
-
There are options that can be used to enable and disable many of these features which are mostly
340
-
listed below.
342
+
### JSON Validation
341
343
342
-
### JSON options
344
+
Spark supports the option `allowNonNumericNumbers`. Versions of Spark prior to 3.3.0 where inconsistent between
345
+
quoted and non-quoted values ([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The
346
+
GPU implementation is consistent with 3.3.0 and above.
343
347
344
-
Spark supports passing options to the JSON parser when reading a dataset. In most cases if the RAPIDS Accelerator
345
-
sees one of these options that it does not support it will fall back to the CPU. In some cases we do not. The
346
-
following options are documented below.
348
+
### JSON Floating Point Types
347
349
348
-
-`allowNumericLeadingZeros` - Allows leading zeros in numbers (e.g. 00012). By default this is set to false.
349
-
When it is false Spark considers the JSON invalid if it encounters this type of number. The RAPIDS
350
-
Accelerator supports validating columns that are returned to the user with this option on or off.
351
-
352
-
-`allowUnquotedControlChars` - Allows JSON Strings to contain unquoted control characters (ASCII characters with
353
-
value less than 32, including tab and line feed characters) or not. By default this is set to false. If the schema
354
-
is provided while reading JSON file, then this flag has no impact on the RAPIDS Accelerator as it always allows
355
-
unquoted control characters but Spark sees these are invalid are returns nulls. However, if the schema is not provided
356
-
and this option is false, then RAPIDS Accelerator's behavior is same as Spark where an exception is thrown
357
-
as discussed in `JSON Schema discovery` section.
358
-
359
-
-`allowNonNumericNumbers` - Allows `NaN` and `Infinity` values to be parsed (note that these are not valid numeric
360
-
values in the [JSON specification](https://json.org)). Spark versions prior to 3.3.0 have inconsistent behavior and will
361
-
parse some variants of `NaN` and `Infinity` even when this option is disabled
362
-
([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The RAPIDS Accelerator behavior is consistent with
363
-
Spark version 3.3.0 and later.
364
-
365
-
### Nesting
366
-
In versions of Spark before 3.5.0 there is no maximum to how deeply nested JSON can be. After
367
-
3.5.0 this was updated to be 1000 by default. The current GPU implementation limits this to 254
368
-
no matter what version of Spark is used. If the nesting level is over this the JSON is considered
369
-
invalid and all values will be returned as nulls.
370
-
371
-
Mixed types can have some problems. If an item being read could have some lines that are arrays
372
-
and others that are structs/dictionaries it is possible an error will be thrown.
373
-
374
-
Dates and Timestamps have some issues and may return values for technically invalid inputs.
375
-
376
-
Floating point numbers have issues generally like with the rest of Spark, and we can parse them into
377
-
a valid floating point number, but it might not match 100% with the way Spark does it.
378
-
379
-
Strings are supported, but the data returned might not be normalized in the same way as the CPU
380
-
implementation. Generally this comes down to the GPU not modifying the input, whereas Spark will
381
-
do things like remove extra white space and parse numbers before turning them back into a string.
350
+
Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).
382
351
383
-
### JSON Floating Point
352
+
### JSON Integral Types
384
353
385
-
Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).
354
+
Versions of Spark prior to 3.3.0 would parse quoted integer values, like "1". But 3.3.0 and above consider
355
+
these to be invalid and will return `null` when parsed as an Integral types. The GPU implementation
356
+
follows 3.3.0 and above.
386
357
387
-
Prior to Spark 3.3.0, reading JSON strings such as `"+Infinity"` when specifying that the data type is `FloatType`
388
-
or `DoubleType` caused these values to be parsed even when `allowNonNumericNumbers` is set to false. Also, Spark
389
-
versions prior to 3.3.0 only supported the `"Infinity"` and `"-Infinity"` representations of infinity and did not
390
-
support `"+INF"`, `"-INF"`, or `"+Infinity"`, which Spark considers valid when unquoted. The GPU JSON reader is
391
-
consistent with the behavior in Spark 3.3.0 and later.
358
+
### JSON Decimal Types
392
359
393
-
Another limitation of the GPU JSON reader is that it will parse strings containing non-string boolean or numeric values where
394
-
Spark will treat them as invalid inputs and will just return `null`.
360
+
Spark supports parsing decimal types either formatted as floating point number or integral numbers, even if it is
361
+
in a quoted string. If it is in a quoted string the local of the JVM is used to determine the number format.
362
+
If the local is not for the `US`, which is the default we will fall back to the CPU because we do not currently
363
+
parse those numbers correctly. The `US` format removes all commas ',' from the quoted string.
364
+
As a part of this, though, non-arabic numbers are also supported. We do not support parsing these numbers
365
+
see (issue 10532)[https://github.com/NVIDIA/spark-rapids/issues/10532].
395
366
396
-
### JSON Dates/Timestamps
367
+
### JSON Date/Timestamp Types
397
368
398
369
Dates and timestamps are not supported by default in JSON parser, since the GPU implementation is not 100%
399
370
compatible with Apache Spark.
400
371
If needed, they can be turned on through the config `spark.rapids.sql.json.read.datetime.enabled`.
401
-
Once enabled, the JSON parser still does not support the `TimestampNTZ` type and will fall back to CPU
402
-
if `spark.sql.timestampType` is set to `TIMESTAMP_NTZ` or if an explicit schema is provided that
403
-
contains the `TimestampNTZ` type.
372
+
This config works for both JSON scan and `from_json`. Once enabled, the JSON parser still does
373
+
not support the `TimestampNTZ` type and will fall back to CPU if `spark.sql.timestampType` is set
374
+
to `TIMESTAMP_NTZ` or if an explicit schema is provided that contains the `TimestampNTZ` type.
404
375
405
376
There is currently no support for reading numeric values as timestamps and null values are returned instead
406
-
([#4940](https://github.com/NVIDIA/spark-rapids/issues/4940)). A workaround would be to read as longs and then cast
407
-
to timestamp.
377
+
([#4940](https://github.com/NVIDIA/spark-rapids/issues/4940)). A workaround would be to read as longs and then cast to timestamp.
408
378
409
-
### JSON Schema discovery
379
+
### JSON Arrays and Structs with Overflowing Numbers
410
380
411
-
Spark SQL can automatically infer the schema of a JSON dataset if schema is not provided explicitly. The CPU
412
-
handles schema discovery and there is no GPU acceleration of this. By default Spark will read/parse the entire
413
-
dataset to determine the schema. This means that some options/errors which are ignored by the GPU may still
414
-
result in an exception if used with schema discovery.
381
+
Spark is inconsistent between versions in how it handles numbers that overflow that are nested in either an array
382
+
or a non-top-level struct. In some versions only the value that overflowed is marked as null. In other versions the
383
+
wrapping array or struct is marked as null. We currently only mark the individual value as null. This matches
384
+
versions 3.4.2 and above of Spark for structs. Arrays on most versions of spark invalidate the entire array if there
385
+
is a single value that overflows within it.
415
386
416
-
### `from_json` function
387
+
### Duplicate Struct Names
417
388
418
-
`JsonToStructs` of `from_json` is based on the same code as reading a JSON lines file. There are
419
-
a few differences with it.
389
+
The JSON specification technically allows for duplicate keys in a struct, but does not explain what to
390
+
do with them. In the case of Spark it is inconsistent between operators which value wins. `get_json_object`
391
+
depends on the query being performed. We do not always match what Spark does. We do match it in many cases,
392
+
but we consider this enough of a corner case that we have not tried to make it work in all cases.
420
393
421
-
The `from_json` function is disabled by default because it is experimental and has some known
422
-
incompatibilities with Spark, and can be enabled by setting
423
-
`spark.rapids.sql.expression.JsonToStructs=true`. You don't need to set
424
-
`spark.rapids.sql.format.json.enabled` and`spark.rapids.sql.format.json.read.enabled` to true.
425
-
In addition, if the input schema contains date and/or timestamp types, an additional config
426
-
`spark.rapids.sql.json.read.datetime.enabled` also needs to be set to `true` in order
427
-
to enable this function on the GPU.
394
+
We also do not support schemas where there are duplicate column names. We just fall back to the CPU for those cases.
428
395
429
-
There is no schema discovery as a schema is required as input to `from_json`
396
+
### JSON Normalization (String Types)
430
397
431
-
In addition to `structs`, a top level `map` type is supported, but only if the key and value are
432
-
strings.
398
+
In versions of Spark prior to 4.0.0 input JSON Strings were parsed to JSON tokens and then converted back to
399
+
strings. This effectively normalizes the output string. So things like single quotes are transformed into double
400
+
quotes, floating point numbers are parsed and converted back to strings possibly changing the format, and
401
+
escaped characters are converted back to their simplest form. We try to support this on the GPU as well. Single quotes
402
+
will be converted to double quotes. Only `get_json_object` and `json_tuple` attempt to normalize floating point
403
+
numbers. There is no implementation on the GPU right now that tries to normalize escape characters.
404
+
405
+
### `from_json` Function
406
+
407
+
`JsonToStructs` or `from_json` is based on the same code as reading a JSON lines file. There are
408
+
a few differences with it.
433
409
434
-
### `to_json` function
410
+
The main difference is that `from_json` supports parsing Maps and Arrays directly from a JSON column, whereas
411
+
JSON Scan only supports parsing top level structs. The GPU implementation of `from_json` has support for parsing
412
+
a `MAP<STRING,STRING>` as a top level schema, but does not currently support arrays at the top level.
435
413
436
-
The `to_json` function is disabled by default because it is experimental and has some known incompatibilities
437
-
with Spark, and can be enabled by setting `spark.rapids.sql.expression.StructsToJson=true`.
414
+
### `to_json` Function
438
415
439
416
Known issues are:
440
417
441
418
- There can be rounding differences when formatting floating-point numbers as strings. For example, Spark may
442
419
produce `-4.1243574E26` but the GPU may produce `-4.124357351E26`.
443
420
- Not all JSON options are respected
444
421
445
-
### get_json_object
422
+
### `get_json_object` Function
446
423
447
424
Known issue:
448
425
-[Floating-point number normalization error](https://github.com/NVIDIA/spark-rapids-jni/issues/1922). `get_json_object` floating-point number normalization on the GPU could sometimes return incorrect results if the string contains high-precision values, see the String to Float and Float to String section for more details.
Copy file name to clipboardexpand all lines: docs/supported_ops.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -9279,7 +9279,7 @@ are limited.
9279
9279
<td rowSpan="2">JsonToStructs</td>
9280
9280
<td rowSpan="2">`from_json`</td>
9281
9281
<td rowSpan="2">Returns a struct value with the given `jsonStr` and `schema`</td>
9282
-
<td rowSpan="2">This is disabled by default because it is currently in beta and undergoes continuous enhancements. Please consult the [compatibility documentation](../compatibility.md#json-supporting-types) to determine whether you can enable this configuration for your use case</td>
9282
+
<td rowSpan="2">None</td>
9283
9283
<td rowSpan="2">project</td>
9284
9284
<td>jsonStr</td>
9285
9285
<td> </td>
@@ -9320,7 +9320,7 @@ are limited.
9320
9320
<td> </td>
9321
9321
<td> </td>
9322
9322
<td><b>NS</b></td>
9323
-
<td><em>PS<br/>MAP only supports keys and values that are of STRING type;<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types NULL, BINARY, CALENDAR, MAP, UDT, DAYTIME, YEARMONTH</em></td>
9323
+
<td><em>PS<br/>MAP only supports keys and values that are of STRING type and is only supported at the top level;<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types NULL, BINARY, CALENDAR, MAP, UDT, DAYTIME, YEARMONTH</em></td>
9324
9324
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types NULL, BINARY, CALENDAR, MAP, UDT, DAYTIME, YEARMONTH</em></td>
0 commit comments