Conversion processors convert NDJSON filetypes to CSV #295

dale-wahl · 2022-10-05T15:34:58Z

Converting NDJSON to CSV is not ideal as it would be better to maintain the structure of the original data and any unmapped information is lost in the new CSV object. This result is a product of using map_item() (which seems essential for working with JSON data that can be dynamic). However, it can be difficult and in some cases impossible to update/convert the original data fields after they have been mapped.

It could be possible to have a reverse_map_item on datasources, however, that may not work in all cases and even be impossible with datasources if their JSON structure changes significantly.

There may be alternate solutions to individual processors. Listed below are some processor identified as having this issue:

processor	issue	possible solution
`accent_fold.py`	allows all `map_item` fields to be converted; currently disabled for ndjson	could remove choice of field from user (and thus ignore map_item) and convert all text fields
`expand_url_shorteners.py`	reverse map "body" and "urls"; currently disabled for ndjson	currently disabled for ndjson
`lexical_filter.py`	adds column that is not "visible" via frontend	fixed to run with ndjson, but visibility still problem
`write_annotations.py`	adds columns that are not "visible" via frontend (also possible field name collisions #293)	fixed to run with ndjson, but visibility still problem

These conversion processors are currently listed as filters in order to take advantage of the standalone dataset feature which allows processors limited to is_top_dataset() (or has no self.key_parent) or module.type.endswith("search") to be used. CSVs/datasources without map_items could work as filters, but processor.is_filter() (used by frontend and maybe elsewhere to locate standalone dataset) would need to differentiate between filetype. Perhaps is_filter() should be pushed to dataset (e.g. something like is_standalone() which doesn't exist yet).

The text was updated successfully, but these errors were encountered:

dale-wahl · 2023-05-11T15:03:11Z

expand_url_shorteners.py processor was deprecated in favor of extract_url.py and consolidate_urls.py both of which are listed as "conversion" processors and create a new sub dataset (not a filter).

stijn-uva added enhancement New feature or request processors Involves self-contained analyticalprocessors. labels Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion processors convert NDJSON filetypes to CSV #295

Conversion processors convert NDJSON filetypes to CSV #295

dale-wahl commented Oct 5, 2022

dale-wahl commented May 11, 2023

Conversion processors convert NDJSON filetypes to CSV #295

Conversion processors convert NDJSON filetypes to CSV #295

Comments

dale-wahl commented Oct 5, 2022

dale-wahl commented May 11, 2023