Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSONL #179

Closed
tucnak opened this issue Nov 23, 2024 · 6 comments
Closed

JSONL #179

tucnak opened this issue Nov 23, 2024 · 6 comments
Labels
documentation Improvements or additions to documentation feature New feature or request priority-medium Medium priority issue user-request This issue was directly requested by a user

Comments

@tucnak
Copy link

tucnak commented Nov 23, 2024

What feature are you requesting?

In the LLM community, JSON Lines file format is widely-used for distributing fine-tuning datasets. For example, see fine-tuning tutorials by Together or OpenAI on how .jsonl files are consumed for reference. Not everybody uses the same format, obviously, but the industry largely seems to have converged on .jsonl for fine-tuning tasks, even though the primary datasets that go into it are usually in CSV and Parquet formats. Furthermore, producing JSONL files from a json(b) set in Postgres is nontrivial:

COPY (SELECT messages FROM dataset)
TO 'dataset.jsonl'
WITH (FORMAT CSV, DELIMITER E'\n', QUOTE E'\x01', ESCAPE E'\x01')

Why are you requesting this feature?

Here at the Foundation we're really happy to have pg_analytics as it enables our analysts without coding background to readily consume Huggingface datasets in Parquet format. Thankfully, pg_analytics accepts file lists, so we were able to create a helper function in PL/Python that is able to fetch Parquet part files from Huggingface before CREATE FOREIGN TABLE. This has enabled our SFT team massively; we have been able to use people with zero programming experience who probably couldn't do this job as easily under different circumstances.

However, we still rely on SWE expertise to carry SFT runs for most prototypes, and .jsonl handling is a major part. There's benefit in being able to read such files, as schemas change and for past SFT runs it's sometimes impossible to tell exactly what it's looked like unless you have the .jsonl file around. Not to mention, if we could use pg_analytics not only to read, but to write JSONL also, it would be huge; surely, for many small labs it would be similarly beneficial.

I know pg_analytics doesn't facilitate writes, but think of it as food for thought!

What is your proposed implementation for this feature?

  1. SELECT from JSONL similar to current JSON support with field to column mappings;
  2. INSERT entries to a .jsonl file on disk (it's my understanding that S3 supports append now...)

Full Name:

Ilya Kowalewski

Affiliation:

The Stone Cross Foundation

@tucnak tucnak added the feature New feature or request label Nov 23, 2024
@philippemnoel
Copy link
Collaborator

I believe this is already supported, but not documented, per here: #180 (which we use under the hood).

@tucnak
Copy link
Author

tucnak commented Nov 25, 2024

Thank you, indeed that's the case. I think I'd tried this earlier but it didn't work for some reason but having tried again there's no problem. Odd. However, when performing INSERT I'm seeing ERROR: option 'rowid_column' is required with Detail: Wrappers. I take it that writes aren't supposed to work, so I reckon DuckDB doesn't supports it? Their documentation only show-cases COPY either to, or from a file.

I imagine that inserts back to file otherwise therefore would be impossible from FDW side?

So for all intents and purposes native COPY in Postgres with CSV delimiters is the only way to go?

@philippemnoel
Copy link
Collaborator

We only support reads for now, but plan to support writes eventually

@philippemnoel
Copy link
Collaborator

@philippemnoel philippemnoel added documentation Improvements or additions to documentation priority-medium Medium priority issue user-request This issue was directly requested by a user labels Nov 25, 2024
@philippemnoel philippemnoel reopened this Nov 26, 2024
@philippemnoel
Copy link
Collaborator

I'll write explicit documentation for JSONL

@philippemnoel
Copy link
Collaborator

Done! Pushed in 3b0b309

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation feature New feature or request priority-medium Medium priority issue user-request This issue was directly requested by a user
Projects
None yet
Development

No branches or pull requests

2 participants