Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KGE files / archives should have md5 and/or sha256 hashes generated and available for download #35

Closed
RichardBruskiewich opened this issue May 25, 2021 · 7 comments
Assignees

Comments

@RichardBruskiewich
Copy link
Collaborator

... generated in the post-processing step after data set uploads

@jeffhhk
Copy link
Contributor

jeffhhk commented Jul 20, 2021

The one hash function currently in use by the Unsecret Agent team is. E.g.:

sha1sum ~/Downloads/semmed.data.zip
e7276d5afac1d13b2909a05618ca14fc07f88c95  semmed.data.zip

@RichardBruskiewich
Copy link
Collaborator Author

What's entailed here is to compute the sha1sum on the file in the client browser before the upload, then upload the hash then have the server recheck the uploaded data. The sha1sum "file" should be added to the KGE file archive.

Any archive created on the server side, for downloading, would also have a sha1sum computed and available for independent downloading by the UI (and/or CLI and/or program library).

@kennethbruskiewicz
Copy link
Collaborator

kennethbruskiewicz commented Jul 31, 2021

The one hash function currently in use by the Unsecret Agent team is. E.g.:

sha1sum ~/Downloads/semmed.data.zip
e7276d5afac1d13b2909a05618ca14fc07f88c95  semmed.data.zip

Hi @jeffhhk, I'd just like to clarify something. In your mind, does semmed.data.zip include both the nodes and the edges you use in your reasoner? In other words, with the hash, are you tracking the uniqueness of the knowledge graph on a whole?

@jeffhhk
Copy link
Contributor

jeffhhk commented Aug 2, 2021

@RichardBruskiewich

compute the sha1sum on the file in the client browser before the upload

Hash before the upload? What would be the benefit? Hashing before upload would compound the significant performance problems in the upload implementation. It would also close off the possibility of labeling the upload with extra information.

@jeffhhk
Copy link
Contributor

jeffhhk commented Aug 2, 2021

@kbruskiewicz Great question. The purpose of the sha1 hash is to track the identity of a particular incarnation of a particular knowledge graph. Thus, if we observe an artifact with a certain sha1 in our system, and we see the same sha1 in KGE, then we can know (with high probabilistic bound) that we do not have to download or reprocess said artifact.

The only thing our system knows how to process is a whole knowledge graph. We do not have a use case for processing one file of a File Set.

@kennethbruskiewicz
Copy link
Collaborator

kennethbruskiewicz commented Aug 3, 2021

@RichardBruskiewich

compute the sha1sum on the file in the client browser before the upload

Hash before the upload? What would be the benefit? Hashing before upload would compound the significant performance problems in the upload implementation. It would also close off the possibility of labeling the upload with extra information.

Richard is referring to some spit-balling we did when we were first thinking through the issue.

I asked this question about handling the archive vs handling files in the archive as it does affect my implementation strategy - our current understanding wants to hash server-side, at the point where an archive is generated. I will continue any broader thoughts in #45.

@RichardBruskiewich
Copy link
Collaborator Author

Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants