Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion processors convert NDJSON filetypes to CSV #295

Open
dale-wahl opened this issue Oct 5, 2022 · 1 comment
Open

Conversion processors convert NDJSON filetypes to CSV #295

dale-wahl opened this issue Oct 5, 2022 · 1 comment
Labels
enhancement New feature or request processors Involves self-contained analyticalprocessors.

Comments

@dale-wahl
Copy link
Member

Converting NDJSON to CSV is not ideal as it would be better to maintain the structure of the original data and any unmapped information is lost in the new CSV object. This result is a product of using map_item() (which seems essential for working with JSON data that can be dynamic). However, it can be difficult and in some cases impossible to update/convert the original data fields after they have been mapped.

It could be possible to have a reverse_map_item on datasources, however, that may not work in all cases and even be impossible with datasources if their JSON structure changes significantly.

There may be alternate solutions to individual processors. Listed below are some processor identified as having this issue:

processor issue possible solution
accent_fold.py allows all map_item fields to be converted; currently disabled for ndjson could remove choice of field from user (and thus ignore map_item) and convert all text fields
expand_url_shorteners.py reverse map "body" and "urls"; currently disabled for ndjson currently disabled for ndjson
lexical_filter.py adds column that is not "visible" via frontend fixed to run with ndjson, but visibility still problem
write_annotations.py adds columns that are not "visible" via frontend (also possible field name collisions #293) fixed to run with ndjson, but visibility still problem

These conversion processors are currently listed as filters in order to take advantage of the standalone dataset feature which allows processors limited to is_top_dataset() (or has no self.key_parent) or module.type.endswith("search") to be used. CSVs/datasources without map_items could work as filters, but processor.is_filter() (used by frontend and maybe elsewhere to locate standalone dataset) would need to differentiate between filetype. Perhaps is_filter() should be pushed to dataset (e.g. something like is_standalone() which doesn't exist yet).

@stijn-uva stijn-uva added enhancement New feature or request processors Involves self-contained analyticalprocessors. labels Dec 13, 2022
@dale-wahl
Copy link
Member Author

expand_url_shorteners.py processor was deprecated in favor of extract_url.py and consolidate_urls.py both of which are listed as "conversion" processors and create a new sub dataset (not a filter).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request processors Involves self-contained analyticalprocessors.
Projects
None yet
Development

No branches or pull requests

2 participants