-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support upload of multi-file archives #45
Comments
KGE supports the upload of node files, edge files, or archives of files into a bucket. If flat files are given, then an archive is created to represent the whole contents of the graph upon download. Else the archive given is the archive downloaded. The concept of making nodes and edges separable things is relative to KGX, which can validate the properties of nodes, or the properties of nodes and edges, or the properties of an archive containing nodes and edges. I'm happy to hear disputes as to the relevance of this case or if it should be discarded. We also ensure that the archive that is downloadable through KGE can be processed for validation by KGX with no intermediate processing required, in part because we might also want to own validating uploaded knowledge graphs. So as long as the archive remains the unit of consumption, even if it's not the unit of uploading, life is grand, since it keeps the production of validation hashes simple, and as something only done once. |
@kbruskiewicz Somehow I got this comment in my email that looks attached to #45, but it's not here anymore:
My response: It sounds like you agree that the archive is the unit of consumption. In the KGE UI, there is a concept of File Set. As long as KGE is going to be in the business of File Sets, then to me it makes sense that any file set should have a standard archive representation. The subscribing user should be insulated from whether the publishing user uploaded individual files or an archive, or from what the format of the archive was. That KGE has to extract an archive from a publishing user is cost sunk by the possibility of KGE doing validation and similar activities. So long as KGE has to extract an archive, it may as well recompress it into a canonical form. As a subscribing user, I don't care much about what the canonical archive format is so long as it is only one form. Pick one and stick with it. Make sure what's in the archive is not another archive. From a subscriber's point of view, the unit of cost is how many archive formats need to be supported. If I were the author of KGE, I would think of the top two considerations on archive is compression rate and popularity of the format. .zip may be the most popular. .tar.bz2 is probably better compressed but less popular. .7z might be even better compressed but even less popular. IMO compression and decompression speed should not be a main consideration of KGE, because downstream users who care can always rearchive. After converting to our app-specific index format, our team uses a combination of .gz, split and tar designed to minimize the time required to download and assemble data in parallel from S3 over 10GBs networking. Here is a fairly well thought out list of tradeoffs from a StackExchange post:
|
I just noticed that download currently produces a .tar.gz. I think .tar.gz is a fine choice for download. |
In the discussion around efficient browser upload #53, the idea of uploading .tar.gz came up again and, @RichardBruskiewich mentioned that "I think some version of that works." I just tried uploading a file named .tgz. Publishing reported success, and allowed me to download a file with this contents:
While the contents of my .tgz file was two files named "nodes.tsv" and "edges.tsv", I'm guessing the empties above are some kind of autogenerated thing that would have been inserted regardless of the contents of my .tgz. @kbruskiewicz do you know whether or not a flavor of multi-file upload works today? If so, how is it detected? |
We appear to have "archive unpacking" on the server side, nodes/edges file aggregation then |
@kbruskiewicz Again, great question on #35. I have posted my narrow answer to the question there the best I could.
In writing that response, I realized that my mental model of how KGE ought to work probably ought to be shared. Among other things, sharing my mental model allows being able to give a better answer to the question about what gets a sha1sum.
I know the rest of this response is more than you asked about. I promise to get to the specific of what gets a sha1sum. Thanks in advance for bearing with me. . .
Let me step back by saying no team has in my recollection communicated with the Unsecret Agent team by exchanging separate files for nodes and edges. We have only dealt in the currency of multi-file archives such as .tgz or .zip.
By contrast, KGE currently solicits separate uploads for nodes and edges. The current KGE approach is in my experience totally unprecedented.
An upload aproach that seems like a better fit between KGE's approach and current practice might look like this:
Now regarding sha1sum: Our use case is that we would only ever download a whole file set or none of it. Identifying the file set with one sha1 and transferring all the files in the set is preferred. However, if some users would prefer to consume files of a File Set piecemeal, I would recommend supporting them by modifying step 6 above. Simply upload from the server to s3 a second set of final uploads, in which each file in the File Set is compressed and sha1summed separately.
In summary, compared to the status quo, problems solved by the upload approach above include:
Your mileage may vary. Thanks for listening.
The text was updated successfully, but these errors were encountered: