CSV dialect detection: implementation without third party libraries #2247

ws-garcia · 2024-10-25T19:56:17Z

Discussed in #2246

^{Originally posted by ws-garcia October 25, 2024}

Problem overview

Currently, this project does not have a stable alternative that allows detecting CSV file configuration. An example of this is raised in #1719, where the utility fails to detect the configuration for the given files.

Details

At the moment, @jqnatividad has begun digging into the problem and claiming

Perhaps, we can tag-team on qsv-sniffer to make its CSV schema inferencing more reliable?

He pointed

Aligning qsv-sniffer's behavior with python's csv sniffer is the way to go!

The work path to go, until now, is outlined in jqnatividad/qsv-sniffer#14. Currently, all tasks are under study but not completed.

New path

In this I will discuss a new approach to implement dialect detection in qsv using trivial elements:

Regexes: determine fields data types.
Current implemented parser: load data.
Table Uniformity measure: detect the table with the best structure.

With this approach the dialect detection is reliable as the CleverCSV one, being able to obtain results with greater certainty. The process is as follows:

In the first phase, potential dialects are built based on field/column separator, quotation marks, and record delimiter characters. In this stage user can provide custom delimiter list, giving the tool a level of flexibility.
With each potential dialect, we attempt to parse the CSV file and use the data to construct temporary table.
The table is scored using the Table Uniformity measurement. Each score is saved in a collection using the dialect as a key.
The dialect that produces the table with the highest score is then selected as the desired one.

A Python implementation of this exact approach is described in a GitHub repository. The evaluation of this methods gives:

Tool	F1 score
`CSVsniffer`	0.9260
`CleverCSV`	0.8425
`csv.Sniffer`	0.8049

This sheds light over one point: the presented approach is clearly outperforming csv.Sniffer and also CleverCSV in the research datasets.

Hoping this can help this wonderful project!

Edit:

Code snippet will be presented in the discussion.

The text was updated successfully, but these errors were encountered:

jqnatividad · 2024-10-25T20:03:40Z

Thanks @ws-garcia !

This is very timely as I was dreading taking on the csv-sniffer python port, thus the lack of activity.

Your step-by-step "new path" breakdown is certainly easier to digest than the paper :)

Will be sure to loop you in as we mark progress...

ws-garcia · 2024-10-25T20:07:27Z

You can use the paper only to implement some logic if you're confused at porting the Python code. So, look at the research as a backup reference to dive in into the implementation.

jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Oct 25, 2024

jqnatividad self-assigned this Oct 25, 2024

jqnatividad added the WIP work in progress label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV dialect detection: implementation without third party libraries #2247

CSV dialect detection: implementation without third party libraries #2247

ws-garcia commented Oct 25, 2024 •

edited

Loading

Problem overview

Details

New path

jqnatividad commented Oct 25, 2024

ws-garcia commented Oct 25, 2024 •

edited

Loading

CSV dialect detection: implementation without third party libraries #2247

CSV dialect detection: implementation without third party libraries #2247

Comments

ws-garcia commented Oct 25, 2024 • edited Loading

Discussed in #2246

Problem overview

Details

New path

jqnatividad commented Oct 25, 2024

ws-garcia commented Oct 25, 2024 • edited Loading

ws-garcia commented Oct 25, 2024 •

edited

Loading

ws-garcia commented Oct 25, 2024 •

edited

Loading