Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Java] CudfException on conversion of data between Arrow and Cudf #16794

Open
mythrocks opened this issue Sep 11, 2024 · 2 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@mythrocks
Copy link
Contributor

Description
The following exception is seen when CUDF JNI bindings are used to convert CUDF data to Arrow format, and then back to CUDF:

"Caused by: ai.rapids.cudf.CudfException: CUDF failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-856-cuda12/thirdparty/cudf/cpp/src/interop/arrow_utilities.cpp:74: Unsupported type_id conversion to cudf",

This was found during integration tests with https://github.com/NVIDIA/spark-rapids and https://github.com/NVIDIA/spark-rapids-jni.

I have narrowed it down to when #16590 was merged. Prior versions of CUDF that don't have this commit seem to work fine.

Repro
We don't yet have a narrow repro that uses on CUDF/JNI. I will include the pyspark repro here, and replace it with something smaller, once we have it:

import sys
import awkward as ak
import numpy as np
from pathlib import Path
from pyspark.sql.types import *
from pyspark.sql import SparkSession, DataFrame
import pyspark.sql.functions as F

def udf_array_square(iterator):
    for batch in iterator:
        b = ak.from_arrow(batch)
        b2 = ak.zip({"doubleF": np.square(b["doubleF"])}, depth_limit=1)
        yield from ak.to_arrow_table(b2).to_batches()

if __name__ == '__main__':
    spark = SparkSession.builder.appName(f'{Path(__file__).stem}').getOrCreate()
    spark.conf.set("spark.rapids.sql.enabled","true")

    df = spark.sql("select Array(rand(),rand(),rand()) doubleF from range(1e6)")
    df.show(truncate=False)

    newdf = df.mapInArrow(udf_array_square, df.schema)
    newdf.show(truncate=False)
    newdf.collect()
    newdf.explain()

Running this Pyspark script causes the CudfException to occur, and the query to fail.

Expected behaviour
One would expect that type-conversions not fail between CUDF and Arrow.

@mythrocks mythrocks added the bug Something isn't working label Sep 11, 2024
@mythrocks
Copy link
Contributor Author

Here are my findings so far:

  1. The failure is the result of the Arrow type ARROW_LARGE_LIST not having a mapping for conversion to the corresponding CUDF type.
  2. It doesn't look like this has to do with a mismatch in NANOARROW types.
  3. It isn't yet clear where the "large list" type lands up being introduced:
    a. The input CUDF table consists of STRUCT< LIST< FLOAT >>.
    b. While tracing, during conversion to Arrow (Table::convertCudfToArrowTable()), one sees that the type ids are congruent.
    c. It is only on the conversion back to CUDF (Table::convertArrowTableToCudf()) that the LIST is deemed a LARGE_LIST. The reader seems to misinterpret that specific type.

I can narrow this repro down further with a Java test shortly.

@mythrocks mythrocks self-assigned this Sep 18, 2024
@mythrocks
Copy link
Contributor Author

I can narrow this repro down further with a Java test shortly.

I should revise this: It wouldn't be productive to convert this into a standalone Java test. The problem is specifically caused because the result of the UDF in the repro is returning an Arrow LARGE_LIST, where it should be returning LIST.

I have done a fair bit of digging into the problem, and bisecting the changes. The crux of my findings concerns the effect of this part of the changes in #16590:

    - auto result = cudf::to_arrow(*tview, state->get_column_metadata(*tview));
    + auto got_arrow_schema = cudf::to_arrow_schema(*tview, state->get_column_metadata(*tview));
    + cudf::jni::set_nullable(got_arrow_schema.get());
    + auto got_arrow_array = cudf::to_arrow_host(*tview);
    + auto batch =
    +   arrow::ImportRecordBatch(&got_arrow_array->array, got_arrow_schema.get()).ValueOrDie();
    + auto result = arrow::Table::FromRecordBatches({batch}).ValueOrDie();

I can confirm that rolling back to using cudf::to_arrow() instead of cudf::to_arrow_host() / cudf::to_arrow_schema() allows the test to pass. Note that for the test to pass, I only rolled back the CUDF->Arrow conversion; I kept the new implementation of Arrow->CUDF.

It seems odd, but something in the way cudf::to_arrow_host() constructs the input Arrow table messes up the types in the output of the UDF.

On the face of it, the schemas of the tables constructed in both methods (i.e. cudf::to_arrow() and cudf::to_arrow_host()) seem to be identical. But I'm sure there's a subtle difference that precipitates the failure.

I'm still hopeful that we should be able to remedy this with a change in either TableJni::convertCudfToArrowTable() or cudf::to_arrow_schema(). I'd better consult @vyasr to confirm that I'm on the right track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Burndown
Development

No branches or pull requests

1 participant