Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support upload of multi-file archives #45

Open
jeffhhk opened this issue Aug 2, 2021 · 5 comments
Open

Support upload of multi-file archives #45

jeffhhk opened this issue Aug 2, 2021 · 5 comments
Assignees
Labels
September 2021 Relay September 2021 Relay bug fix / feature addition

Comments

@jeffhhk
Copy link
Contributor

jeffhhk commented Aug 2, 2021

@kbruskiewicz Again, great question on #35. I have posted my narrow answer to the question there the best I could.

In writing that response, I realized that my mental model of how KGE ought to work probably ought to be shared. Among other things, sharing my mental model allows being able to give a better answer to the question about what gets a sha1sum.

I know the rest of this response is more than you asked about. I promise to get to the specific of what gets a sha1sum. Thanks in advance for bearing with me. . .

Let me step back by saying no team has in my recollection communicated with the Unsecret Agent team by exchanging separate files for nodes and edges. We have only dealt in the currency of multi-file archives such as .tgz or .zip.

By contrast, KGE currently solicits separate uploads for nodes and edges. The current KGE approach is in my experience totally unprecedented.

An upload aproach that seems like a better fit between KGE's approach and current practice might look like this:

  1. Begin creation of a "file set" with the user uploading one multi-file archive.
  2. On the server, extract and save a file list from the archive, and save the File Set a draft-like state.
  3. Have the user select or annotate any additional information necessary to create the File Set. For example, display the list of files in the archive and provide check boxes to show which file might contain nodes, edges, or metadata.
  4. Perform at this step any validations, once validations are implemented.
  5. Collect the essentials of what the user did in 3), along with what the server did in 4) into a .json file, and add it to the archive with a special semantically versioned filename like KGE_v1.json.
  6. Recreate the archive with KGE_v1.json added. Use one preferred archive format such as .tar.gz or .tar.bz if it saves trouble.
  7. Take the sha1sum of the final archive.
  8. Publish the final archive with its sha1sum.

Now regarding sha1sum: Our use case is that we would only ever download a whole file set or none of it. Identifying the file set with one sha1 and transferring all the files in the set is preferred. However, if some users would prefer to consume files of a File Set piecemeal, I would recommend supporting them by modifying step 6 above. Simply upload from the server to s3 a second set of final uploads, in which each file in the File Set is compressed and sha1summed separately.

In summary, compared to the status quo, problems solved by the upload approach above include:

  • We avoid asking the user to wait more than once for upload, adding to attention disruption.
  • If a KG provider had previously exchanged KGX-compliant data in the way we have seen, they can get started with KGE by pressing an upload button right away, instead of first rearranging file archives.

Your mileage may vary. Thanks for listening.

@kennethbruskiewicz
Copy link
Collaborator

kennethbruskiewicz commented Aug 3, 2021

KGE supports the upload of node files, edge files, or archives of files into a bucket. If flat files are given, then an archive is created to represent the whole contents of the graph upon download. Else the archive given is the archive downloaded.

The concept of making nodes and edges separable things is relative to KGX, which can validate the properties of nodes, or the properties of nodes and edges, or the properties of an archive containing nodes and edges. I'm happy to hear disputes as to the relevance of this case or if it should be discarded.

We also ensure that the archive that is downloadable through KGE can be processed for validation by KGX with no intermediate processing required, in part because we might also want to own validating uploaded knowledge graphs.

So as long as the archive remains the unit of consumption, even if it's not the unit of uploading, life is grand, since it keeps the production of validation hashes simple, and as something only done once.

@jeffhhk
Copy link
Contributor Author

jeffhhk commented Aug 4, 2021

@kbruskiewicz Somehow I got this comment in my email that looks attached to #45, but it's not here anymore:

I'm interested in hearing if .tgz or .zip makes a difference to you. From the point of view of the sausage-maker, given the current setup I prefer .zip, however KGE is built with the assumption that it will give and take .tgz archives.

My response:

It sounds like you agree that the archive is the unit of consumption. In the KGE UI, there is a concept of File Set. As long as KGE is going to be in the business of File Sets, then to me it makes sense that any file set should have a standard archive representation. The subscribing user should be insulated from whether the publishing user uploaded individual files or an archive, or from what the format of the archive was. That KGE has to extract an archive from a publishing user is cost sunk by the possibility of KGE doing validation and similar activities. So long as KGE has to extract an archive, it may as well recompress it into a canonical form.

As a subscribing user, I don't care much about what the canonical archive format is so long as it is only one form. Pick one and stick with it. Make sure what's in the archive is not another archive. From a subscriber's point of view, the unit of cost is how many archive formats need to be supported.

If I were the author of KGE, I would think of the top two considerations on archive is compression rate and popularity of the format. .zip may be the most popular. .tar.bz2 is probably better compressed but less popular. .7z might be even better compressed but even less popular. IMO compression and decompression speed should not be a main consideration of KGE, because downstream users who care can always rearchive. After converting to our app-specific index format, our team uses a combination of .gz, split and tar designed to minimize the time required to download and assemble data in parallel from S3 over 10GBs networking.

Here is a fairly well thought out list of tradeoffs from a StackExchange post:

https://superuser.com/questions/205223/pros-and-cons-of-bzip-vs-gzip

Gzip and bzip2 are functionally equivalent. (There once was a bzip, but it seems to have completely vanished off the face of the world.) Other common compression formats are zip, rar and 7z; these three do both compression and archiving (packing multiple files into one). Here are some typical ratings in terms of speed, availability and typical compression ratio (note that these ratings are somewhat subjective, don't take them as gospel):

decompression speed (fast > slow): gzip, zip > 7z > rar > bzip2
compression speed (fast > slow): gzip, zip > bzip2 > 7z > rar
compression ratio (better > worse): 7z > rar, bzip2 > gzip > zip
availability (unix): gzip > bzip2 > zip > 7z > rar
availability (windows): zip > rar > 7z > gzip, bzip2

As you can see, there isn't a clear winner.

@jeffhhk
Copy link
Contributor Author

jeffhhk commented Aug 5, 2021

I just noticed that download currently produces a .tar.gz. I think .tar.gz is a fine choice for download.

@jeffhhk
Copy link
Contributor Author

jeffhhk commented Sep 10, 2021

In the discussion around efficient browser upload #53, the idea of uploading .tar.gz came up again and, @RichardBruskiewich mentioned that "I think some version of that works." I just tried uploading a file named .tgz. Publishing reported success, and allowed me to download a file with this contents:

ls -l 1.9
total 8
-rw-r--r-- 1 jeff jeff    0 Sep 10 10:37 edges.tsv
-rw-r--r-- 1 jeff jeff  483 Sep 10 10:37 file_set.yaml
-rw-r--r-- 1 jeff jeff    0 Sep 10 10:37 nodes.tsv
-rw-r--r-- 1 jeff jeff 1496 Jul 29 20:01 provider.yaml

While the contents of my .tgz file was two files named "nodes.tsv" and "edges.tsv", I'm guessing the empties above are some kind of autogenerated thing that would have been inserted regardless of the contents of my .tgz.

@kbruskiewicz do you know whether or not a flavor of multi-file upload works today? If so, how is it detected?

@RichardBruskiewich
Copy link
Collaborator

We appear to have "archive unpacking" on the server side, nodes/edges file aggregation then tar.gz of the downloadable archive, all working on the server. We can review the above issue in this list, verifying if we have all our bases covered.

@RichardBruskiewich RichardBruskiewich added the September 2021 Relay September 2021 Relay bug fix / feature addition label Sep 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
September 2021 Relay September 2021 Relay bug fix / feature addition
Projects
None yet
Development

No branches or pull requests

3 participants