JSONL #179

tucnak · 2024-11-23T16:37:41Z

What feature are you requesting?

In the LLM community, JSON Lines file format is widely-used for distributing fine-tuning datasets. For example, see fine-tuning tutorials by Together or OpenAI on how .jsonl files are consumed for reference. Not everybody uses the same format, obviously, but the industry largely seems to have converged on .jsonl for fine-tuning tasks, even though the primary datasets that go into it are usually in CSV and Parquet formats. Furthermore, producing JSONL files from a json(b) set in Postgres is nontrivial:

COPY (SELECT messages FROM dataset)
TO 'dataset.jsonl'
WITH (FORMAT CSV, DELIMITER E'\n', QUOTE E'\x01', ESCAPE E'\x01')

Why are you requesting this feature?

Here at the Foundation we're really happy to have pg_analytics as it enables our analysts without coding background to readily consume Huggingface datasets in Parquet format. Thankfully, pg_analytics accepts file lists, so we were able to create a helper function in PL/Python that is able to fetch Parquet part files from Huggingface before CREATE FOREIGN TABLE. This has enabled our SFT team massively; we have been able to use people with zero programming experience who probably couldn't do this job as easily under different circumstances.

However, we still rely on SWE expertise to carry SFT runs for most prototypes, and .jsonl handling is a major part. There's benefit in being able to read such files, as schemas change and for past SFT runs it's sometimes impossible to tell exactly what it's looked like unless you have the .jsonl file around. Not to mention, if we could use pg_analytics not only to read, but to write JSONL also, it would be huge; surely, for many small labs it would be similarly beneficial.

I know pg_analytics doesn't facilitate writes, but think of it as food for thought!

What is your proposed implementation for this feature?

SELECT from JSONL similar to current JSON support with field to column mappings;
INSERT entries to a .jsonl file on disk (it's my understanding that S3 supports append now...)

Full Name:

Ilya Kowalewski

Affiliation:

The Stone Cross Foundation

The text was updated successfully, but these errors were encountered:

philippemnoel · 2024-11-25T19:55:59Z

I believe this is already supported, but not documented, per here: #180 (which we use under the hood).

tucnak · 2024-11-25T20:26:12Z

Thank you, indeed that's the case. I think I'd tried this earlier but it didn't work for some reason but having tried again there's no problem. Odd. However, when performing INSERT I'm seeing ERROR: option 'rowid_column' is required with Detail: Wrappers. I take it that writes aren't supposed to work, so I reckon DuckDB doesn't supports it? Their documentation only show-cases COPY either to, or from a file.

I imagine that inserts back to file otherwise therefore would be impossible from FDW side?

So for all intents and purposes native COPY in Postgres with CSV delimiters is the only way to go?

philippemnoel · 2024-11-25T21:09:01Z

We only support reads for now, but plan to support writes eventually

philippemnoel · 2024-11-25T21:09:10Z

Docs here: https://github.com/paradedb/pg_analytics/blob/dev/docs/object_stores/huggingface.mdx

philippemnoel · 2024-11-26T15:52:46Z

I'll write explicit documentation for JSONL

philippemnoel · 2024-12-16T17:06:31Z

Done! Pushed in 3b0b309

tucnak added the feature New feature or request label Nov 23, 2024

tucnak mentioned this issue Nov 23, 2024

Overriding HTTP headers from DDL and environment #180

Closed

philippemnoel added documentation Improvements or additions to documentation priority-medium Medium priority issue user-request This issue was directly requested by a user labels Nov 25, 2024

philippemnoel closed this as completed Nov 26, 2024

philippemnoel reopened this Nov 26, 2024

philippemnoel closed this as completed Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSONL #179

JSONL #179

tucnak commented Nov 23, 2024

philippemnoel commented Nov 25, 2024

tucnak commented Nov 25, 2024

philippemnoel commented Nov 25, 2024

philippemnoel commented Nov 25, 2024

philippemnoel commented Nov 26, 2024

philippemnoel commented Dec 16, 2024

JSONL #179

JSONL #179

Comments

tucnak commented Nov 23, 2024

What feature are you requesting?

Why are you requesting this feature?

What is your proposed implementation for this feature?

Full Name:

Affiliation:

philippemnoel commented Nov 25, 2024

tucnak commented Nov 25, 2024

philippemnoel commented Nov 25, 2024

philippemnoel commented Nov 25, 2024

philippemnoel commented Nov 26, 2024

philippemnoel commented Dec 16, 2024