tsvx: Strictly typed TSV format with metadata

TSV is a nice format. It's human-readable, very fast to parse, and concise. It's possible to process as a streaming format.

It's also painfully hard to work with due to lack of standardization and typing. Each file requires a different set of heuristics to interpret. Those heuristics might break 80GB into a 100GB file, when an oddball case comes up. Those heuristics also translate into a tremendous amount of unnecessary ETL work. Even importing a TSV/CSV into a spreadsheet causes a dialog to pop up.

This is a proposal for an enhanced TSV format. If this proposal is successful, a spreadsheet would be able to open a MySQL export without prompting the user for help, and most data will flow freely between tools like pandas, tableau, R, and others (excepting issues of formatting and precision).

tsvx files are a tab-separated format, where:

Columns are statically typed, with declared types, and standardized string escaping, dates, and similar for other data types.
We standardize column headers which provide metadata about column names, types, and contents.
We standardize overall metadata for overall file description, providence, creation time, authors, and otherwise.

This is repository includes a simple parser for such files so we can experiment with this concept. It also includes scripts to export from MySQL, both on a table level and on a whole database level.

What tsvx looks like

title: Food inventory
created-date: 2016-10-29T15:25:29.449640
---------------------
Food Name   Weight   Price       Expiration Date
foodname    weight   netprice    exp             (variables)
str         int      float       ISO8601-date    (types)
String      Number   Number      String          (json)
---------------------
Tuna        300      5.13        2017-10-12
Salmon      150      7.18        2018-10-12
Swordfish   250      9.41        2016-11-13
...

Why?

We have nice standards for documents (XML, as well as other markup languages). We have nice standards for objects and configuration data (YAML and JSON). While numerical and tabular data is becoming increasingly important, we don't have any nice standards for representing it; it is all very ad-hoc.

It is also increasingly being interchanged across tools. We see organizations which have terabytes of data in systems like Hadoop and Vertica, gigabytes in databases and flat files, and kilobytes of hand-curated data in spreadsheets, all of which is integrated in common systems.

Our design goals are:

Human-readability
Compatibility. Excel, Google Docs, Hadoop, LibreOffice, MySQL, PostgreSQL, Python, R, Vertica, and other tools should open each other's imports and exports, maintaining basic type information without additional scripting.
Simple parsing. Processing scripts are simpler and less brittle. With tsvx, adding a column or changing column order doesn't break scripts.
In-Line Documentation. We have many TSV files sitting around, often years old, and no idea of what they contain, or how to regenerate them on current data.
Backwards-compatibility. A tsvx file should open reasonably in spreadsheets unaware of tsvx
Fast, single-pass processing. tsvx files are usable up to perhaps 10GB of data.
Extensibility. If it becomes popular, it should be able to replace formats from tools like mysqldump with something less ad-hoc.

Status

We are in a request-for-comments period. If we decide this is a good idea, we'll make this more production-ready, and perhaps draft an IETF RFC or similar standard.

File Structure

The file has three sections:

File metadata (optional) -- Information about the file itself.
Column headings (required) -- Metadata about the names and types in the columns
Data (required) -- What one might traditionally find in a TSV file.

Sections are separated by a line of all dashes, containing at least three dashes. If the first section is omitted, the file should start with a line of dashes.

File Metadata

The metadata is a YAML dictionary. The first line must contain a colon, but we recommend having all lines in this format. Standard fields include:

authors -- JSON list of authors
created-date -- ISO8601 date and time of when the file was created
description -- A multiline description
generator -- Some identifier of the program or source which created the file
modified-date -- ISO8601 date and time of last modification
title -- A single line title/description

Headers/Column Metadata

First line of the headers are human-readable column headers (required). Following lines define additional information about each column. Required is types, which says how the data ought to be interpreted. Currently defined types are:

ISO8601-date -- Date as 2014-12-30
ISO8601-datetime -- Date and time as 2014-12-30T11:59:00.01
bool -- Boolean. trueorfalse`
float -- Floating point number, either containing a space or otherwise
int -- Integer
str -- JSON-encoded string (quotes omitted)

As a fallback, we strongly recommend a json section. All columns should be defined as one of three JSON types:

Boolean -- true or false
Number -- Integer or floating point
String -- Most other data types fall into this category

If this is missing, parsers should treat unrecognized types as strings.

variables is also strongly recommended. variables gives useful variable names to assign in a program. These should be letters, numbers, and underscores, but may not begin with a number. This is convenient for automatic parsers and parser generators.

In addition, there may be headers for other metadata, such as units, or vendor extensions.

Vendor Extensions

Vendors may add arbitrary extensions to both metadata and headers. The keys should begin with a program name and a dash. For example, mysql could add rows called mysql-types and mysql-keys to the headers, which would allow imports/exports to maintain both type information, and which columns are unique. mysql could also place global column metadata (such as multi-column constraints or storage engine) in the file-wide metadata with keys such as mysql-constraints and mysql-engine, or just in its own YAML section.

Such extensions should pay close attention to human readability.

Getting started

This is a before-and-after example of the same code, with a TSV file in Python, versus with a tsvx file and the library:

After:

for line in tsvx.reader(open("file.tsvx"))
   do_some_stuff(line.foodname,
                 line.weight,
                 line.price,
                 line.expiration)

Before:

f = open("file.tsv")
# Skip headers
f.next()
# Break on tabs
split_lines = (l[:-1].split('\t') for l in f)
# Parse to types
parsed_lines = (
   [l[0], int(l[1]), float(l[2]), dateutil.parser.parse(l[3])]
   for l in split_lines
)

for foodname, weight, price, expiration in parsed_lines:
   do_some_stuff(foodname, weight, price, expiration)

There is a also a small file in the repo, example/example.py, which shows how the prototype reference library works. Note that both the API and file format are still mutable -- we are actively soliciting feedback.

Library

This library is not in any sort of stable shape. Indeed, it's more there to explore what the APIs ought to be, and how it ought to be structured.

Please feel free to comment, criticise, make pull requests, etc.

We also don't make claims to being too performant. The goal is to be programmer-friendly in this iteration. We might get to performance later.

The library is structured for use as an internal data exchange format. We'd want to provide a "safe" mode before accepting external TSVx files from untrusted sources. Likely, this just consists of getting rid of access through line.attribute and only using line['attribute']

We did just step up to Python 3! Older commits do Python 2. We haven't tested the newer version with 2 (but it probably doesn't work).

A More Complex File

And a more complex example, to show how extension fit in:

title: Food inventory
description: A sample tsvx file
created-date: 2016-10-29T15:25:29.449640
generator: myoffice.py
myoffice-version: 2.7
---------------------
Food Name   Weight     Price       Expiration Date
foodname    weight     price       expiration      (variables)
str         int        float       ISO8601-date    (types)
            kg         dollars/kg                  (units)
String      Number     Number      String          (json)
VARCHAR(80) SMALLINT   DOUBLE      VARCHAR(20)     (mysql-types)
primary                                            (mysql-keys)
inventory   shipweight gross       exp             (myoffice-schema)
%s          %i         %.2f        YYYY-MM-DD      (myoffice-format-strings)
---------------------
Tuna        300        5.13        2017-10-12
Salmon      150        7.18        2018-10-24
Swordfish   250        9.41        2016-11-13

Name	Name	Last commit message	Last commit date
Latest commit pmitros Stupid writer close Jun 9, 2020 f883c63 · Jun 9, 2020 History 48 Commits
data	data	Flushing out format changes through code	Oct 30, 2016
doc_source	doc_source	Doc updates	Nov 4, 2016
docs	docs	Update docs/web page to match	Jan 13, 2017
example	example	MySQL dump script	Jan 12, 2017
scripts	scripts	More robustness, interrim script	Jun 26, 2017
tests	tests	more typos	Jun 4, 2020
tsvx	tsvx	Stupid writer close	Jun 9, 2020
.gitignore	.gitignore	Initial commit -- starting to prototype	Oct 26, 2016
LICENSE	LICENSE	Initial commit	Oct 12, 2016
README.md	README.md	Move to Python 3, some cleanup	Jun 3, 2020
make_docs.py	make_docs.py	Build HTML from README	Oct 29, 2016
requirements.txt	requirements.txt	MySQL dump script	Jan 12, 2017
setup.py	setup.py	Move to Python 3, some cleanup	Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tsvx: Strictly typed TSV format with metadata

What tsvx looks like

Why?

Status

File Structure

File Metadata

Headers/Column Metadata

Vendor Extensions

Getting started

Library

A More Complex File

About

Releases

Packages

Languages

License

pmitros/tsvx

Folders and files

Latest commit

History

Repository files navigation

tsvx: Strictly typed TSV format with metadata

What tsvx looks like

Why?

Status

File Structure

File Metadata

Headers/Column Metadata

Vendor Extensions

Getting started

Library

A More Complex File

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages