Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf::io::read_json does not verify output column structures with the input schema #16799

Closed
ttnghia opened this issue Sep 11, 2024 · 1 comment · Fixed by #17029
Closed
Labels
bug Something isn't working cuIO cuIO issue

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Sep 11, 2024

Look at this example:

  std::string const json_string = R"({"data": [1, 2, 3], "data": [4, 5, 6]})";

  std::map<std::string, cudf::io::schema_element> dtype_schema;
  dtype_schema["data"] = cudf::io::schema_element{cudf::data_type{cudf::type_id::STRUCT}, {}};

  auto& child_schema = dtype_schema["data"].child_types;
  child_schema["a"]  = cudf::io::schema_element{cudf::data_type{cudf::type_id::INT64}, {}};
  child_schema["b"]  = cudf::io::schema_element{cudf::data_type{cudf::type_id::INT64}, {}};

  cudf::io::json_reader_options json_lines_options =
    cudf::io::json_reader_options::builder(
      cudf::io::source_info{json_string.c_str(), json_string.size()})
      .dtypes(dtype_schema)
      .lines(true);

  auto const table = cudf::io::read_json(json_lines_options);
  cudf::test::print(table.tbl->get_column(0).view());

Output:

cudf::list_view<int64_t>:
Length : 1
Offsets : 0, 3
   1, 2, 3, 4, 5, 6

Notice that the input schema in the example above is "data": STRUCT<"a": INT64, "b": INT64>. On the other hand, the data field in the input JSON is an array instead. We can observe that the data column after parsing (which is a lists column) is just gathered for generating the output. The desired output here should be all nulls instead, and the correct behavior here should be to make sure the output columns have the right types/structures that we are looking for as specified in the input schema.

@ttnghia ttnghia added bug Something isn't working cuIO cuIO issue labels Sep 11, 2024
@karthikeyann
Copy link
Contributor

input type is only a hint unless prune_columns(true)
Try prune_columns(true)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants