[BUG] [Java] CudfException on conversion of data between Arrow and Cudf #16794

mythrocks · 2024-09-11T16:58:10Z

Description
The following exception is seen when CUDF JNI bindings are used to convert CUDF data to Arrow format, and then back to CUDF:

"Caused by: ai.rapids.cudf.CudfException: CUDF failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-856-cuda12/thirdparty/cudf/cpp/src/interop/arrow_utilities.cpp:74: Unsupported type_id conversion to cudf",

This was found during integration tests with https://github.com/NVIDIA/spark-rapids and https://github.com/NVIDIA/spark-rapids-jni.

I have narrowed it down to when #16590 was merged. Prior versions of CUDF that don't have this commit seem to work fine.

Repro
We don't yet have a narrow repro that uses on CUDF/JNI. I will include the pyspark repro here, and replace it with something smaller, once we have it:

import sys
import awkward as ak
import numpy as np
from pathlib import Path
from pyspark.sql.types import *
from pyspark.sql import SparkSession, DataFrame
import pyspark.sql.functions as F

def udf_array_square(iterator):
    for batch in iterator:
        b = ak.from_arrow(batch)
        b2 = ak.zip({"doubleF": np.square(b["doubleF"])}, depth_limit=1)
        yield from ak.to_arrow_table(b2).to_batches()

if __name__ == '__main__':
    spark = SparkSession.builder.appName(f'{Path(__file__).stem}').getOrCreate()
    spark.conf.set("spark.rapids.sql.enabled","true")

    df = spark.sql("select Array(rand(),rand(),rand()) doubleF from range(1e6)")
    df.show(truncate=False)

    newdf = df.mapInArrow(udf_array_square, df.schema)
    newdf.show(truncate=False)
    newdf.collect()
    newdf.explain()

Running this Pyspark script causes the CudfException to occur, and the query to fail.

Expected behaviour
One would expect that type-conversions not fail between CUDF and Arrow.

The text was updated successfully, but these errors were encountered:

mythrocks · 2024-09-11T17:08:50Z

Here are my findings so far:

The failure is the result of the Arrow type ARROW_LARGE_LIST not having a mapping for conversion to the corresponding CUDF type.
It doesn't look like this has to do with a mismatch in NANOARROW types.
It isn't yet clear where the "large list" type lands up being introduced:
a. The input CUDF table consists of STRUCT< LIST< FLOAT >>.
b. While tracing, during conversion to Arrow (Table::convertCudfToArrowTable()), one sees that the type ids are congruent.
c. It is only on the conversion back to CUDF (Table::convertArrowTableToCudf()) that the LIST is deemed a LARGE_LIST. The reader seems to misinterpret that specific type.

I can narrow this repro down further with a Java test shortly.

mythrocks · 2024-09-18T23:41:04Z

I can narrow this repro down further with a Java test shortly.

I should revise this: It wouldn't be productive to convert this into a standalone Java test. The problem is specifically caused because the result of the UDF in the repro is returning an Arrow LARGE_LIST, where it should be returning LIST.

I have done a fair bit of digging into the problem, and bisecting the changes. The crux of my findings concerns the effect of this part of the changes in #16590:

    - auto result = cudf::to_arrow(*tview, state->get_column_metadata(*tview));
    + auto got_arrow_schema = cudf::to_arrow_schema(*tview, state->get_column_metadata(*tview));
    + cudf::jni::set_nullable(got_arrow_schema.get());
    + auto got_arrow_array = cudf::to_arrow_host(*tview);
    + auto batch =
    +   arrow::ImportRecordBatch(&got_arrow_array->array, got_arrow_schema.get()).ValueOrDie();
    + auto result = arrow::Table::FromRecordBatches({batch}).ValueOrDie();

I can confirm that rolling back to using cudf::to_arrow() instead of cudf::to_arrow_host() / cudf::to_arrow_schema() allows the test to pass. Note that for the test to pass, I only rolled back the CUDF->Arrow conversion; I kept the new implementation of Arrow->CUDF.

It seems odd, but something in the way cudf::to_arrow_host() constructs the input Arrow table messes up the types in the output of the UDF.

On the face of it, the schemas of the tables constructed in both methods (i.e. cudf::to_arrow() and cudf::to_arrow_host()) seem to be identical. But I'm sure there's a subtle difference that precipitates the failure.

I'm still hopeful that we should be able to remedy this with a change in either TableJni::convertCudfToArrowTable() or cudf::to_arrow_schema(). I'd better consult @vyasr to confirm that I'm on the right track.

mythrocks added the bug Something isn't working label Sep 11, 2024

mythrocks self-assigned this Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [Java] CudfException on conversion of data between Arrow and Cudf #16794

[BUG] [Java] CudfException on conversion of data between Arrow and Cudf #16794

mythrocks commented Sep 11, 2024

mythrocks commented Sep 11, 2024

mythrocks commented Sep 18, 2024

[BUG] [Java] CudfException on conversion of data between Arrow and Cudf #16794

[BUG] [Java] CudfException on conversion of data between Arrow and Cudf #16794

Comments

mythrocks commented Sep 11, 2024

mythrocks commented Sep 11, 2024

mythrocks commented Sep 18, 2024