Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Requested types ignored if prune_schema is enabled for JSON reading #16797

Open
revans2 opened this issue Sep 11, 2024 · 0 comments
Open
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Sep 11, 2024

Describe the bug
I noticed this in some unit tests for the java APIs when I tried to enable schema pruning in CUDF by default for java JSON read APIs that explicitly do column pruning.

  • @Test
    void testReadJSONNestedTypes() {
    Schema.Builder root = Schema.builder();
    Schema.Builder a = root.addColumn(DType.STRUCT, "a");
    a.addColumn(DType.STRING, "b");
    a.addColumn(DType.STRING, "c");
    a.addColumn(DType.STRING, "missing");
    Schema.Builder d = root.addColumn(DType.LIST, "d");
    d.addColumn(DType.INT64, "ignored");
    root.addColumn(DType.INT64, "also_missing");
    Schema.Builder e = root.addColumn(DType.LIST, "e");
    Schema.Builder eChild = e.addColumn(DType.STRUCT, "ignored");
    eChild.addColumn(DType.INT64, "f");
    eChild.addColumn(DType.STRING, "missing_in_list");
    eChild.addColumn(DType.INT64, "g");
    Schema schema = root.build();
    JSONOptions opts = JSONOptions.builder()
    .withLines(true)
    .build();
    StructType aStruct = new StructType(true,
    new BasicType(true, DType.STRING),
    new BasicType(true, DType.STRING),
    new BasicType(true, DType.STRING));
    ListType dList = new ListType(true, new BasicType(true, DType.INT64));
    StructType eChildStruct = new StructType(true,
    new BasicType(true, DType.INT64),
    new BasicType(true, DType.STRING),
    new BasicType(true, DType.INT64));
    ListType eList = new ListType(true, eChildStruct);
    try (Table expected = new Table.TestBuilder()
    .column(aStruct,
    new StructData(null, "C1", null),
    new StructData("B2", "C2", null),
    null,
    null)
    .column(dList,
    null,
    null,
    Arrays.asList(1L,2L,3L),
    new ArrayList<Long>())
    .column((Long)null, null, null, null) // also_missing
    .column(eList,
    null,
    null,
    null,
    Arrays.asList(new StructData(null, null, 1L), new StructData(2L, null, null), new StructData(3L, null, 4L)))
    .build();
    Table table = Table.readJSON(schema, opts, NESTED_JSON_DATA_BUFFER)) {
    assertTablesAreEqual(expected, table);
    }
    }
    which fails because column d is being returned as a LIST<INT8> instead of a LIST<INT64> which is what it was requested to be, and which is what is returned for column d if pruning is disabled.
  • @Test
    void testReadJSONNestedTypesDataSource() {
    Schema.Builder root = Schema.builder();
    Schema.Builder a = root.addColumn(DType.STRUCT, "a");
    a.addColumn(DType.STRING, "b");
    a.addColumn(DType.STRING, "c");
    a.addColumn(DType.STRING, "missing");
    Schema.Builder d = root.addColumn(DType.LIST, "d");
    d.addColumn(DType.INT64, "ignored");
    root.addColumn(DType.INT64, "also_missing");
    Schema.Builder e = root.addColumn(DType.LIST, "e");
    Schema.Builder eChild = e.addColumn(DType.STRUCT, "ignored");
    eChild.addColumn(DType.INT64, "g");
    Schema schema = root.build();
    JSONOptions opts = JSONOptions.builder()
    .withLines(true)
    .build();
    StructType aStruct = new StructType(true,
    new BasicType(true, DType.STRING),
    new BasicType(true, DType.STRING),
    new BasicType(true, DType.STRING));
    ListType dList = new ListType(true, new BasicType(true, DType.INT64));
    StructType eChildStruct = new StructType(true,
    new BasicType(true, DType.INT64));
    ListType eList = new ListType(true, eChildStruct);
    try (Table expected = new Table.TestBuilder()
    .column(aStruct,
    new StructData(null, "C1", null),
    new StructData("B2", "C2", null),
    null,
    null)
    .column(dList,
    null,
    null,
    Arrays.asList(1L,2L,3L),
    new ArrayList<Long>())
    .column((Long)null, null, null, null) // also_missing
    .column(eList,
    null,
    null,
    null,
    Arrays.asList(new StructData(1L), new StructData((Long)null), new StructData(4L)))
    .build();
    MultiBufferDataSource source = sourceFrom(NESTED_JSON_DATA_BUFFER);
    Table table = Table.readJSON(schema, opts, source)) {
    assertTablesAreEqual(expected, table);
    }
    }
    is failing for the same reason as the above one. column d is the wrong type.
  • @Test
    void testReadJSONNestedTypesVerySmallChanges() {
    Schema.Builder root = Schema.builder();
    Schema.Builder e = root.addColumn(DType.LIST, "e");
    Schema.Builder eChild = e.addColumn(DType.STRUCT, "ignored");
    eChild.addColumn(DType.INT64, "g");
    eChild.addColumn(DType.INT64, "f");
    Schema schema = root.build();
    JSONOptions opts = JSONOptions.builder()
    .withLines(true)
    .build();
    StructType eChildStruct = new StructType(true,
    new BasicType(true, DType.INT64),
    new BasicType(true, DType.INT64));
    ListType eList = new ListType(true, eChildStruct);
    try (Table expected = new Table.TestBuilder()
    .column(eList,
    null,
    null,
    null,
    Arrays.asList(new StructData(1L, null), new StructData(null, 2L), new StructData(4L, 3L)))
    .build();
    Table table = Table.readJSON(schema, opts, NESTED_JSON_DATA_BUFFER)) {
    assertTablesAreEqual(expected, table);
    }
    }
    is failing because column e was requested to be a LIST<STRUCT>, but it was returned as a LIST<INT8> column.

Steps/Code to reproduce bug
If you want to reproduce this you can take #16796 and enable column pruning for the tests that are listed as failing. The third test is the scariest one, and it appears to return totally invalid results where the data column is empty despite the there being offsets pointing into it.

If I need to create a C++ repro case I am happy to do it

Expected behavior
I would expect the types in the schema to be honored at least in the same way that it is for the non pruning use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

1 participant