You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Converting NDJSON to CSV is not ideal as it would be better to maintain the structure of the original data and any unmapped information is lost in the new CSV object. This result is a product of using map_item() (which seems essential for working with JSON data that can be dynamic). However, it can be difficult and in some cases impossible to update/convert the original data fields after they have been mapped.
It could be possible to have a reverse_map_item on datasources, however, that may not work in all cases and even be impossible with datasources if their JSON structure changes significantly.
There may be alternate solutions to individual processors. Listed below are some processor identified as having this issue:
processor
issue
possible solution
accent_fold.py
allows all map_item fields to be converted; currently disabled for ndjson
could remove choice of field from user (and thus ignore map_item) and convert all text fields
expand_url_shorteners.py
reverse map "body" and "urls"; currently disabled for ndjson
currently disabled for ndjson
lexical_filter.py
adds column that is not "visible" via frontend
fixed to run with ndjson, but visibility still problem
write_annotations.py
adds columns that are not "visible" via frontend (also possible field name collisions #293)
fixed to run with ndjson, but visibility still problem
These conversion processors are currently listed as filters in order to take advantage of the standalone dataset feature which allows processors limited to is_top_dataset() (or has no self.key_parent) or module.type.endswith("search") to be used. CSVs/datasources without map_itemscould work as filters, but processor.is_filter() (used by frontend and maybe elsewhere to locate standalone dataset) would need to differentiate between filetype. Perhaps is_filter() should be pushed to dataset (e.g. something like is_standalone() which doesn't exist yet).
The text was updated successfully, but these errors were encountered:
expand_url_shorteners.py processor was deprecated in favor of extract_url.py and consolidate_urls.py both of which are listed as "conversion" processors and create a new sub dataset (not a filter).
Converting NDJSON to CSV is not ideal as it would be better to maintain the structure of the original data and any unmapped information is lost in the new CSV object. This result is a product of using
map_item()
(which seems essential for working with JSON data that can be dynamic). However, it can be difficult and in some cases impossible to update/convert the original data fields after they have been mapped.It could be possible to have a
reverse_map_item
on datasources, however, that may not work in all cases and even be impossible with datasources if their JSON structure changes significantly.There may be alternate solutions to individual processors. Listed below are some processor identified as having this issue:
accent_fold.py
map_item
fields to be converted; currently disabled for ndjsonexpand_url_shorteners.py
lexical_filter.py
write_annotations.py
These conversion processors are currently listed as filters in order to take advantage of the standalone dataset feature which allows processors limited to
is_top_dataset()
(or has noself.key_parent
) ormodule.type.endswith("search")
to be used. CSVs/datasources withoutmap_items
could work as filters, butprocessor.is_filter()
(used by frontend and maybe elsewhere to locate standalone dataset) would need to differentiate between filetype. Perhapsis_filter()
should be pushed todataset
(e.g. something likeis_standalone()
which doesn't exist yet).The text was updated successfully, but these errors were encountered: