-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs/perf best practices update #7586
base: master
Are you sure you want to change the base?
Docs/perf best practices update #7586
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the awesome contribution! Our performance guidelines definitely has room for improvements. Few comments:
- I don't think Fuse examples are relevant in this doc.
- Should import really go under "Operate directly on the storage"? I'll let @talSofer to decide.
lakeFS offers multiple ways to do that: | ||
lakeFS offers multiple ways to do that. | ||
|
||
### Use zero-copy import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this is the appropriate location for this section. For example, I could import data to lakeFS and then read it directly from lakeFS. So that flow isn't "operating directly on the object store"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, The zero-copy import section talks about how to regularly feed data into lakeFS rather than how to interact with data already managed by lakeFS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you prefer to organize this information?
The existing flow makes sense to me but I'm not dead set on it.
The reason it makes sense to me is that when I looked at the LakeFS architecture diagram, the first question in my mind was "how do I bypass the lakefs service for bulk operations on data", as that's the obvious bottleneck assuming the object storage is S3 or similar. This then leads to 3 questions, which are addressed in this section:
- If I have data already in S3, how do I make LakeFS aware of it without re-copying? Answer: use zero-copy writes / imports . (This is where DVC fell out of our evaluation, btw...)
- If I already have a large dataset in LakeFS, how do I add new data? 2 answers: another zero copy import is fine, or write by performing metadata operations in lakefs and writing directly to storage
- How do I do efficient, scaleable reads? Answer: get the urls from the metadata service, talk to S3/GCS directly using the urls.
And the fuse stuff got in here because the follow up to the third question is, what if I am not writing my own reader, but rather using fuse to mount a bucket - does this mean I'm SOL for using LakeFS at all, and if not, do all the reads go through the slow way, streaming data through the LakeFS service? And the answer to that is also no, LakeFS thought of that, it's all done via symlinks and you can read directly from the storage.
* Read an object getObject (lakeFS OpenAPI) and add `--presign`. You'll get a link to download an object. | ||
* Use statObject (lakeFS OpenAPI) which will return `physical_address` that is the actual S3/GCS path that you can read with any S3/GCS client. | ||
|
||
### Reading directly from GCS when using GCS-Fuse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this belongs in a dedicated GCS-Fuse page, not in a general best performance practices guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted the wording and section header level a bit to clarify the relevance, but I do think it's worth calling out here because "how the heck will this work scaleably, if at all" is a performance question, and it doesn't hurt to answer it here and point at the GCS-Fuse page for details.
The GCS-Fuse page should probably live under integrations/gcs-fuse
and perhaps also linked from deploy/gcp.md
, not inside Vertex, as it's not just vertex that uses it, but I'll save that for a separate pull request :)
Also while we are on the topic, this is a very clever way to integrate with Fuse, kudos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dvryaboy for your valued contribution!
I added some comments and agreed with @itaiad200's suggestions to
- Relocate the GCS-Fuse docs to a dedicated page. They are already at https://docs.lakefs.io/integrations/vertex_ai.html#using-lakefs-with-cloud-storage-fuse. We will be happy to see contributions/changes to that page if you are interested!
- Keep the use zero-copy import section in its original place.
We will be happy to get your thoughts on this :)
This is achieved by `lakectl fs upload --pre-sign` (Docs: [lakectl-upload][lakectl-upload]). The equivalent OpenAPI endpoint will return a URL to which the user can upload the file(s) in question. | ||
|
||
### Read directly from the object store | ||
lakeFS maintains versions by keeping track of each file; the commit and branch paths (`my-repo/commits/{commit_id}`, `my-repo/branches/main`) are virtual and resolved by the lakefs service to iterate through appropriate files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this part belongs to the concepts and model page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a version of this is in the concepts and model page ("A lot of what lakeFS does is to manage how lakeFS paths translate to physical paths on the object store.", etc).
Do you feel this sort of thing needs to be DRYed up? My thought was that "repetition doesn't spoil the prayer", as the saying goes, and a brief mention here helps set the context without expecting the reader to have gone in detail through other pages.
@@ -31,9 +26,61 @@ When accessing data using the branch name (e.g. `lakefs://repo/main/path`) lakeF | |||
For more information, see [how uncommitted data is managed in lakeFS][representing-refs-and-uncommitted-metadata]. | |||
|
|||
## Operate directly on the storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to breakdown this section down into sub-sections, i'd suggest that each section describe a way to operate directly on the storage rather than distinguishing between reads and writes that are supported by most ways. That is, I'd use a structure like:
Operate directly on the storage
Pre-sign URLs
lakeFS Hadoop Filesystem
Staging API
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that can work but can't personally commit to that as I don't quite grok the details of the staging API and the lakeFS HDFS setup (sadly, I know more than I would like to about HDFS itself :) ).
Is this something you'd like to address in this PR or is it possible to address in a subsequent one?
lakeFS offers multiple ways to do that: | ||
lakeFS offers multiple ways to do that. | ||
|
||
### Use zero-copy import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, The zero-copy import section talks about how to regularly feed data into lakeFS rather than how to interact with data already managed by lakeFS.
@talSofer and @itaiad200 , thank you for the feedback! |
This PR is now marked as stale after 30 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label. |
Closing this PR because it has been stale for 7 days with no activity. |
This PR is now marked as stale after 30 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label. |
Closes #(Insert issue number closed by this PR)
Change Description
Adds some more details to perf tips regarding reading directly from the backing object store, based on conversation in Slack [1][2]