Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-file write support to the js and python sdks #451

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

0div
Copy link
Contributor

@0div 0div commented Oct 4, 2024

Description

  • Allow the Filesystem.write method to accept multiple files
  • Add tests to assert that envd supports multipart with multiple files out of the box

Test

# Test js-sdk
cd packages/js-sdk
pnpm test

# Test python-sdk
cd packages/python-sdk
pnpm test

@0div 0div requested a review from ValentaTomas October 4, 2024 18:01
@0div 0div self-assigned this Oct 4, 2024
Copy link

linear bot commented Oct 4, 2024

Copy link

changeset-bot bot commented Oct 4, 2024

⚠️ No Changeset found

Latest commit: 86262f1

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@0div 0div changed the title Added multi-file write support Add multi-file write support Oct 4, 2024
@0div 0div added the Improvement Improvement for current functionality label Oct 4, 2024
@0div 0div changed the title Add multi-file write support Add multi-file write support to the js-sdk Oct 4, 2024
@ValentaTomas
Copy link
Member

ValentaTomas commented Oct 5, 2024

I pushed an edit that should sketch a type cleanup.

There are two unfinished parts:

  1. It should be possible to typeguard this so there is no type assumption when dealing with the parameters.
    const { path, writeOpts, writeFiles } = typeof pathOrFiles === 'string'
      ? { path: pathOrFiles, writeFiles: [{ data: dataOrOpts }], writeOpts: opts }
      : { path: undefined, writeFiles: pathOrFiles, writeOpts: dataOrOpts }

    const blobs = await Promise.all(writeFiles.map(f => new Response(f.data).blob()))
  1. The problem here was actually there even in the original SDK code — the EntryInfo type assumption works because the returned types are exactly of the shape of EntryInfo, but I think we should property type these by creating a mapping. Also, this will not correctly return the array if I just passed an array with one file.
    return files.length === 1 ? files[0] : files

Some edits I did:

  • Changed filename to path in the list of files to make it the same as the argument name in a method that takes only one
  • The method returned array for > 1, but if you passed an empty array as the files, it would throw an error. There is still the problem I mentioned in 2.

Also, your previous code was mostly correct; this is more of an improvement.

@ValentaTomas
Copy link
Member

ValentaTomas commented Oct 5, 2024

I also think the path might be redundant as an API argument here because we can just send the file with the filename as usual. (I read the response in the envd infra PR, so not that sure here.)

Let's keep it in the envd API though, because one of the uses was that you can use it to ensure that the path people upload files to is fixated.

@0div
Copy link
Contributor Author

0div commented Oct 7, 2024

@ValentaTomas I addressed your comments and added extra tests for some edge cases.

@0div
Copy link
Contributor Author

0div commented Oct 7, 2024

One important thing to note, and I've added it as a code comment, is that we can't expect specified directories in path of multipart filename to be taken into consideration; I've tested it with and sure enough only file name is used as path, the rest of the path is stripped by the std lib's "mime/multipart" when calling

pathToResolve = part.FileName()

@ValentaTomas
Copy link
Member

One important thing to note, and I've added it as a code comment, is that we can't expect specified directories in path of multipart filename to be taken into consideration; I've tested it with and sure enough only file name is used as path, the rest of the path is stripped by the std lib's "mime/multipart" when calling

pathToResolve = part.FileName()

Ok, this is a very good find — how do you think we should handle this? We want to be able to upload files with any path, but the stripping of paths might make sense to preserve, because it allows people to upload by link easily. This might require some changes to the envd.

@0div
Copy link
Contributor Author

0div commented Oct 7, 2024

One important thing to note, and I've added it as a code comment, is that we can't expect specified directories in path of multipart filename to be taken into consideration; I've tested it with and sure enough only file name is used as path, the rest of the path is stripped by the std lib's "mime/multipart" when calling

pathToResolve = part.FileName()

Ok, this is a very good find — how do you think we should handle this? We want to be able to upload files with any path, but the stripping of paths might make sense to preserve, because it allows people to upload by link easily. This might require some changes to the envd.

If it's really important not to break the current API spec, we could add a custom field to the multipart dataform and look for it with some changes to envd

@0div 0div changed the title Add multi-file write support to the js-sdk Add multi-file write support to the js and python sdks Oct 7, 2024
@0div
Copy link
Contributor Author

0div commented Oct 7, 2024

@ValentaTomas I started by adding multi file write support for sandbox_sync to get your opinion on this approach before moving forward with _async.

Considerations

I went the OR route when it comes to typing, as function overload is not natively supported in python. Thate being said, with some non-negligible overhead, it is kindof achievable, but im not enthralled by the idea, especially when thinking about previous work generating the api-ref automatically: i have a feeling it wouldn't play well with it.

@ValentaTomas
Copy link
Member

ValentaTomas commented Oct 7, 2024

For the overload I think you can use the same system as we already have with https://github.com/e2b-dev/E2B/blob/beta/packages/python-sdk/e2b/sandbox_sync/process/main.py#L106

It should be the same thing, right?

@ValentaTomas
Copy link
Member

ValentaTomas commented Oct 7, 2024

I also suggest naming and exporting all the types (from both SDKs). What do you say about having:

  • WriteData — this is then union type for the data (string, bytes, etc)
  • WriteEntry — this is the type that contains both the path and data: WriteData fields.

@ValentaTomas
Copy link
Member

I'm thinking about what to do when you try to invoke write for multiple files and provide an empty array.

Logically, you might want to notify the user that nothing was written, but throwing an error might not be optimal. If you are generating the field to write, you need to explicitly check if the array is empty; otherwise, you will get an error.

In contrast to this, isn't writing 0 files a valid operation and you will also get an array with 0 results so everything is ok?

@ValentaTomas
Copy link
Member

One important thing to note, and I've added it as a code comment, is that we can't expect specified directories in path of multipart filename to be taken into consideration; I've tested it with and sure enough only file name is used as path, the rest of the path is stripped by the std lib's "mime/multipart" when calling

pathToResolve = part.FileName()

Ok, this is a very good find — how do you think we should handle this? We want to be able to upload files with any path, but the stripping of paths might make sense to preserve, because it allows people to upload by link easily. This might require some changes to the envd.

If it's really important not to break the current API spec, we could add a custom field to the multipart dataform and look for it with some changes to envd

Yeah, I'm thinking that we should probably do this, because people are already using the Beta SDK.

@0div
Copy link
Contributor Author

0div commented Oct 8, 2024

I'm thinking about what to do when you try to invoke write for multiple files and provide an empty array.

Logically, you might want to notify the user that nothing was written, but throwing an error might not be optimal. If you are generating the field to write, you need to explicitly check if the array is empty; otherwise, you will get an error.

In contrast to this, isn't writing 0 files a valid operation and you will also get an array with 0 results so everything is ok?

From the perpective that there will likely be less control over which files, if any, are generated, your point makes sense. I will allow empty arrays

@mlejva
Copy link
Member

mlejva commented Oct 10, 2024

I usually don't like the passing arrays.

When you pass an array the function will return an array and will it have only one element? should I check it has the element or should I risk index out of bounds error?

That's a good argument. I agree. I don't like dealing with arrays and checking the length as well

@ValentaTomas
Copy link
Member

I also like the single upload method because it is something users already use a lot; requiring to always pass an array didn't sit well with me.

For splitting the methods — I really don't want to end up with more methods for the same thing on the same level.

export type WriteEntry = {
path: string
data: string | ArrayBuffer | Blob | ReadableStream
}

I think this is the better solution here than the WriteData I originally proposed.

@ValentaTomas
Copy link
Member

ValentaTomas commented Oct 10, 2024

I usually don't like the passing arrays.
When you pass an array the function will return an array and will it have only one element? should I check it has the element or should I risk index out of bounds error?

That's a good argument. I agree. I don't like dealing with arrays and checking the length as well

For the multifile upload, you have to pass an array though, right?

EDIT: If I understand it correctly, this is an argument for having the single file upload there, right

@mlejva
Copy link
Member

mlejva commented Oct 10, 2024

I think this is the better solution here than the WriteData I originally proposed.

It would be great if we adopt this approach in general. With Typescript it's a thin line between using types to hide some abstraction and over-using types.

@mlejva
Copy link
Member

mlejva commented Oct 10, 2024

For splitting the methods — I really don't want to end up with more methods for the same thing on the same level.

I can maybe see us using the overrides in JS but I really don't like this write signature in Python:

path_or_files: str | List[WriteEntry],
data_or_user: WriteData | Username = "user",
user_or_request_timeout: Optional[float | Username] = None,

It's not clear at all how to use these methods just from looking at the code. We should really aim for trying to be as clear as possible just from the code.

It creates many questions in my had, why is data and user mixed together? Why is user and request timeout mixed together? It's very unintuitive

@ValentaTomas
Copy link
Member

ValentaTomas commented Oct 10, 2024

Just to clear up a possible confusion here, because when I checked the start of the discussion it might not be clear:

        path_or_files: str | List[WriteEntry],
        data_or_user: WriteData | Username = "user",
        user_or_request_timeout: Optional[float | Username] = None,

This signature should never be seen by the user, because the method is @overloaded, so the visible method signatures will be:

  1. This will be default and selected first if you start typing:
    @overload
    def write(
        self,
        path: str,
        data: WriteData,
        user: Username = "user",
        request_timeout: Optional[float] = None,
    ) -> EntryInfo:
  1. You can switch to this one explicitly or implicitly by passing an array (because of duck typing) as first argument or files argument.
    @overload
    def write(
        self,
        files: List[WriteEntry],
        user: Optional[Username] = "user",
        request_timeout: Optional[float] = None,
    ) -> List[EntryInfo]:

@mlejva
Copy link
Member

mlejva commented Oct 10, 2024

Just to clear up a possible confusion here, because of the start of the discussion:

        path_or_files: str | List[WriteEntry],
        data_or_user: WriteData | Username = "user",
        user_or_request_timeout: Optional[float | Username] = None,

This signature should never be seen by the user, because the method is @overloaded, so the visible method signatures will be:

  1. This will be default and selected first if you start typing:
    @overload
    def write(
        self,
        path: str,
        data: WriteData,
        user: Username = "user",
        request_timeout: Optional[float] = None,
    ) -> EntryInfo:
  1. You can switch to this one explicitly or implicitly by passing an array (because of duck typing) as first argument or files argument.
    @overload
    def write(
        self,
        files: List[WriteEntry],
        user: Optional[Username] = "user",
        request_timeout: Optional[float] = None,
    ) -> List[EntryInfo]:

I see. I must have missed the @overload, sorry. What does the user option do? Is that new?

I'll leave @jakubno to add any notes if has anything additional. As I said - he has the most Python experience

@jakubno
Copy link
Member

jakubno commented Oct 10, 2024

This signature should never be seen by the user, because the method is @OverLoaded, so the visible method signatures will be:

It ain't true, users can see the source code for python libraries and I often go there

@ValentaTomas
Copy link
Member

ValentaTomas commented Oct 10, 2024

The user is an optional argument that can be used to modify who you are making the filesystem operation as — if you need to delete a file owned by root or create a file that is owned by a different user that the default one.

We had a lot of confusion around the users/permissions, so having this explicitly visible and modifiable seems better than hiding it again.

EDIT: It is new in a sense that it is on all relevant methods in the Beta SDK.

@ValentaTomas
Copy link
Member

ValentaTomas commented Oct 10, 2024

This signature should never be seen by the user, because the method is @OverLoaded, so the visible method signatures will be:

It ain't true, users can see the source code for python libraries and I often go there

Well, I cannot argue with that. At that point isn't the usage clear?

EDIT: What I wanted to say — are you worried about the overload implementation being confusing to people inspecting the code?

@mlejva
Copy link
Member

mlejva commented Oct 10, 2024

The user is an optional argument that can be used to modify who you are making the filesystem operation as — if you need to delete a file owned by root or create a file that is owned by a different user that the default one.

We had a lot of confusion around the users/permissions, so having this explicitly visible and modifiable seems better than hiding it again.

Is this only filesystem thing? Or on all methods now? My immediate thought would be to expect something like this on every relevant method.

We had a lot of confusion around the users/permissions, so having this explicitly visible and modifiable seems better than hiding it again.

I'm pretty sure this won't solve it. People will be still confused and won't know what to pass there. The solution is that users shouldn't need to mess with permissions by default. The filesystem should be accessible to the default user. I had very relevant feedback for this from our hackathon:

Problem uploading data to sandbox and not being able to open the file because of permissions. Had to use sudo. Everything was in /home/user.

@0div
Copy link
Contributor Author

0div commented Oct 10, 2024

Would the pythonic *args (Non-Keyword Arguments) be a good middle-ground here?
And its equivalent in TS function sum(...args) {

That way a dev can use one or more args or w/o using arrays.

@ValentaTomas
Copy link
Member

ValentaTomas commented Oct 10, 2024

Would the pythonic *args (Non-Keyword Arguments) be a good middle-ground here? And its equivalent in TS function sum(...args) {

That way a dev can use one or more args or w/o using arrays.

I think it would be slightly problematic (more so in TS) because we also want to pass the opts for the request.

@mlejva
Copy link
Member

mlejva commented Oct 10, 2024

plus don't we lose all the type hints?

@ValentaTomas
Copy link
Member

ValentaTomas commented Oct 10, 2024

Is this only filesystem thing? Or on all methods now? My immediate thought would be to expect something like this on every relevant method.

Yes, it is on all relevant methods.

By default you don't need to mess with this, but when interacting with filesystem, not having this exposed would lead you to not being able to do this. So that's why I choose to do less magic here — the user you pass is used for the operation, there is default of user that is being user, but you can override it.

@mlejva
Copy link
Member

mlejva commented Oct 10, 2024

Is this only filesystem thing? Or on all methods now? My immediate thought would be to expect something like this on every relevant method.
Yes, it is on all relevant methods.

By default you don't need to mess with this, but when interacting with filesystem, not having this exposed would lead you to not being able to do this. So that's why I choose to do less magic here — the user you pass is used for the operation, there is default of user that is being user, but you can override it.

What are the use cases when you need it? It's a simple parameter but it opens a whole can of new questions like "how do I create a new user?" or "how do I switch to a different user when running a command?" that we'll need to provide answers for. That's why I'm not a big fan of this. It feels only half way there because it's missing all the technical content and utilities around it.

@ValentaTomas
Copy link
Member

ValentaTomas commented Oct 10, 2024

plus don't we lose all the type hints?

The variadic argument could be typed, but generally, it should be the last argument. Otherwise, it won't work.

@mishushakov
Copy link
Member

Sorry to get in between, but for the Browserbase SDK I was in a similar situation, we ended up with a load_url method to load a single web page, load_urls to load multiple and a load method to handle both (depending on input)

@mlejva
Copy link
Member

mlejva commented Oct 10, 2024

Sorry to get in between, but for the Browserbase SDK I was in a similar situation, we ended up with a load_url method to load a single web page, load_urls to load multiple and a load method to handle both (depending on input)

Thanks for the input. Would you do it the same way again?

I think I'm leaning increasingly more towards two separate methods. I'm not sure about having a third one to potentially handle both single and multi use case.

@mishushakov
Copy link
Member

mishushakov commented Oct 10, 2024

Would do two methods. My thinking was (still is): I want a function to return X. If a function should return Y, I make a function that returns Y. A function that returns both X and Y is flaky

@ValentaTomas
Copy link
Member

ValentaTomas commented Oct 10, 2024

@mishushakov Why you ended up with three in the end?

@mishushakov
Copy link
Member

Because it felt like if we have the other two why not have it also. But tbh., the whole thing was out of necessity, because I didn't want to start a browser every time I want to fetch a URL, load_urls would connect once and do the same thing as load for multiple urls. I actually prefer the Node way - you can files.map > fs.writeFile > Promise.all(files)

@ValentaTomas
Copy link
Member

Three sounds the worst :D

Let's do the two separate if everybody agrees:

  • write_file/writeFile
  • write_files/writeFiles

@mlejva
Copy link
Member

mlejva commented Oct 10, 2024

Three sounds the worst :D

Let's do the two separate if everybody agrees:

  • write_file/writeFile
  • write_files/writeFiles

I agree about separate methods but not sure about the naming because there's already files module:

sandbox.files.writeFile() / sandbox.files.write_file()
sandbox.files.write_files() / sandbox.files.write_files()

Maybe feels a little weird?

@ValentaTomas
Copy link
Member

ValentaTomas commented Oct 10, 2024

Alternative namespace solutions:

  • sandbox.fs.write_file (Not sure if it is clear that you should use this for filesystem)
  • sandbox.filesystem.write_file (This seems too long)
  • sandbox.write_file (Can we get rid of all namespaces without making everything confusing?)

EDIT: I actually feel that maybe not having namespaces lowers the mental difficulty for me when using the SDK — you just type what you want to do as opposed to having to think in which namespace the method should be.

EDIT 2: Also I would not even argue for overloaded write method without the files namespace, because without namespaces you need to name it more specifically like writeFile which tells you that it takes just one file.

@0div
Copy link
Contributor Author

0div commented Oct 10, 2024

Alternative namespace solutions:

  • sandbox.fs.write_file (Not sure if it is clear that you should use this for filesystem)
  • sandbox.filesystem.write_file (This seems too long)
  • sandbox.write_file (Can we get rid of all namespaces without making everything confusing?)

EDIT: I actually feel that maybe not having namespaces lowers the mental difficulty for me when using the SDK — you just type what you want to do as opposed to having to think in which namespace the method should be.

EDIT 2: Also I would not even argue for overloaded write method without the files namespace, because without namespaces you need to name it more specifically like writeFile which tells you that it takes just one file.

Keeping a namespace or prefixing the method name could help disambiguate a bit since that kinduv io touches other linux abstractions.

@mishushakov
Copy link
Member

I'm going with no namespaces. With Stripe SDK it makes sense to have namespaces as they have different "products" and the products have different methods available. We only have one "product" (or class) which is the sandbox, so we should only have sandbox methods on top-level, eg: sandbox.writeFile.

@0div
Copy link
Contributor Author

0div commented Oct 11, 2024

I'm going with no namespaces. With Stripe SDK it makes sense to have namespaces as they have different "products" and the products have different methods available. We only have one "product" (or class) which is the sandbox, so we should only have sandbox methods on top-level, eg: sandbox.writeFile.

The question is then is this a single product API or a VM (with its components) API?

Base automatically changed from beta to main October 16, 2024 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvement Improvement for current functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants