-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSONL #179
Comments
I believe this is already supported, but not documented, per here: #180 (which we use under the hood). |
Thank you, indeed that's the case. I think I'd tried this earlier but it didn't work for some reason but having tried again there's no problem. Odd. However, when performing I imagine that inserts back to file otherwise therefore would be impossible from FDW side? So for all intents and purposes native |
We only support reads for now, but plan to support writes eventually |
I'll write explicit documentation for JSONL |
Done! Pushed in 3b0b309 |
What feature are you requesting?
In the LLM community, JSON Lines file format is widely-used for distributing fine-tuning datasets. For example, see fine-tuning tutorials by Together or OpenAI on how
.jsonl
files are consumed for reference. Not everybody uses the same format, obviously, but the industry largely seems to have converged on.jsonl
for fine-tuning tasks, even though the primary datasets that go into it are usually in CSV and Parquet formats. Furthermore, producing JSONL files from a json(b) set in Postgres is nontrivial:Why are you requesting this feature?
Here at the Foundation we're really happy to have pg_analytics as it enables our analysts without coding background to readily consume Huggingface datasets in Parquet format. Thankfully, pg_analytics accepts file lists, so we were able to create a helper function in PL/Python that is able to fetch Parquet part files from Huggingface before
CREATE FOREIGN TABLE
. This has enabled our SFT team massively; we have been able to use people with zero programming experience who probably couldn't do this job as easily under different circumstances.However, we still rely on SWE expertise to carry SFT runs for most prototypes, and
.jsonl
handling is a major part. There's benefit in being able to read such files, as schemas change and for past SFT runs it's sometimes impossible to tell exactly what it's looked like unless you have the.jsonl
file around. Not to mention, if we could use pg_analytics not only to read, but to write JSONL also, it would be huge; surely, for many small labs it would be similarly beneficial.I know pg_analytics doesn't facilitate writes, but think of it as food for thought!
What is your proposed implementation for this feature?
SELECT
from JSONL similar to current JSON support with field to column mappings;INSERT
entries to a.jsonl
file on disk (it's my understanding that S3 supports append now...)Full Name:
Ilya Kowalewski
Affiliation:
The Stone Cross Foundation
The text was updated successfully, but these errors were encountered: